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^  Ihe  application  of  coaputer^aided  design  (CAD)  tools  in 
the  full  custoa  design  and  testing  of  a  16-bit  pipelined 
tvo*s  cccpleaent  anltiplier  in  three  nicron  NHOS  is 
described.  A  conparison  between  the  fall  custoa  carry-save 
addition  (CSA)  aultiplier  designed  using  CAD  tools  and  a 
aultiplier  generated  by  the  MacPitts  silicon  coapiler  is 
presented.  Additional  background  material  is  also  presented 
on  the  CSA  Bultiplication  algorithm  utili^d-.  -> 
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I.  IHTBODDCTIOH 

Kith  the  ever  increasing  demand  for  extremely  complex 
integrated  circuits,  today *s  electrical  engineers  and 
systems  designers  have  to  be  knowledgeable  in  the  design  and 
fabrication  of  Very  large  Scale  Integrated  (VLSI)  circuits. 
Several  approaches  exist  today  for  the  design  of  VLSI 
circuits.  These  approaches  include  the  interconnection  of 
standard  library  cells,  gate  arrays,  programmable  logic 
arrays,  and  full  custom  design.  Full  custom  design  is  the 
most  time  consuming  and  expensive  of  the  three,  but  gener¬ 
ally  yields  a  more  efficient  VLSI  design  in  terms  of  circuit 
density  and  speed  of  operation. 

One  methodology  for  full  custom  design  that  can  be 
easily  understood  and  implemented  by  the  systems  designer 
has  been  developed  by  Head  and  Conway  [Bef.  1].  This  meth¬ 
odology,  coupled  with  the  wide  variety  of  computer-aided 
design  (CAD)  tools  that  are  available,  makes  it  possible  for 
the  systems  designer  to  translate  a  design  from  a  functional 
block  diagram,  or  a  Icgic  diagram,  to  silicon.  Intelligent 
simulaticn  of  the  design  prior  to  fabrication  gives  the 
designer  a  high  degree  of  confidence  that  the  circuit  func¬ 
tions  as  desired,  barring  any  unforeseen  fabrication  errors. 

Another  method  that  is  available  for  the  generaticn  of 
VLSI  circuits  is  the  use  of  a  silicon  compiler  which  takes 
as  input  an  algorithmic  description  of  a  circuit *s  desired 
functions  and  generates  the  final  layout  of  a  VLSI  circuit. 
Using  this  approach  to  circuit  design  results  in  a  rapid 
design  turn-around  time.  This  allows  the  system  designer 
the  ability  to  explore  different  architectures  and  find  the 
method  best  suited  tc  solve  a  specific  problem.  One  such 
compiler  that  is  installed  and  running  at  the  Naval 


Postgraduate  School  (BPS)  is  the  HacPitts  silicon  compiler 
developed  at  Bassachosetts  Institute  of  Technology's  Lincoln 
Laboratory.  The  installation  and  initial  research  on  the 
BacPitts  compiler  is  documented  in  work  done  previously  by 
Carlscn  [Bef.  2].  Carlson  utilized  the  BacPitts  silicon 
compiler  to  generate  an  8-bit  unsigned  pipelined  multiplier 
to  be  used  in  a  digital  filter.  To  provide  the  basis  for 
comparison  of  a  full  custom  design  and  a  design  generated  by 
the  BacPitts  silicon  compiler,  a  16-bit  two's  covplement 
multiplier  in  three  micron  HBOS  was  hand-crafted  using  CAD 
tools  currently  available  at  NPS. 

The  discussion  of  a  general  carry-save  addition  (CSA) 
multiplier  follows  in  Chapter  2.  Chapter  3  presents  the 
adaptaticn  of  the  CSA  multiplication  scheme  to  the  16- bit 
two's  complement  multiplier.  The  remainder  of  Chapter  3 
contains  the  design  and  testing  of  the  multiplier  and  a 
description  of  the  CAD  tools  utilized.  Chapter  4  presents  a 
test  plan  for  the  VLSI  circuit  after  its  fabrication  by  the 
BOS  Implementation  Service  (BOSIS)  of  the  Defense  Advanced 
Research  Projects  Agency.  This  is  followed  by  a  comparison 
of  the  hand-crafted  and  BacPitts  generated  multipliers  in 
Chapter  5. 


II.  ONSIGNED  BIHABI  MDLTIPLICATION 


Is  this  chapter,  the  inplementation  of  an  unsigned 
binary  parallel  multiplier  is  described.  First,  a  brief 
discussion  of  the  add-and-shift  algorithm  is  presented. 
Although  almost  every  reference  in  digital  arithmetic 
contains  a  section  on  this  algorithm  (also  called  seguential 
multiplication) ,  it  is  given  here  so  that  terminology  and 
representations  used  in  this  chapter  and  the  next  may  be 
introduced.  Ifext,  a  multiplication  scheme  utilizing  simul¬ 
taneous  goL;’^ration  of  partial  products  followed  by  simulta¬ 
neous  rsv-uction  using  carry-save  addition  (CSA)  is 
described.  The  chapter  concludes  with  a  discussion  of 
implementing  this  parallel  multiplication  scheme  as  a  pipe¬ 
lined  VLSI  design. 

A.  ADD-AND-SHIFT  ALGOEITHH 

The  basis  for  the  multiplier  design  presented  in  this 
chapter  is  the  add-and-shift  algorithm,  which  is  similar  to 
the  way  one  multiplies  using  pencil  and  paper.  For  example, 
as  shewn  in  Figure  2.1,  in  multiplying  two  binary  numbers 
each  bit  of  the  multiplier  requires  a  corresponding  add-and- 
shift  operation. 

A  mathematical  representation  of  the  add-and-shift  algo¬ 
rithm  for  two  n-bit  numbers  is  given  in  Equation  2.1.  This 
equation  has  been  derived  from  chapter  2  of  Introduction  to 
Computer  Architecture  by  Stone  and  others  [Hef.  3]. 

P  =  "e'  2*0*1!.  ^^'3^ 

A  ^0 

In  this  equation  and  throughout  the  remainder  of  this 
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Figure  2.1  Paper  and  Pencil  Hultiplication. 

thesis,  concatenation  implies  the  logical  AND,  the  symbol 
implies  the  logical  CE,  b  represents  the  n-bit  multiplicand 
vector,  a^  represents  bit  n  of  the  multiplier  vector  a  and  P 
represents  the  2n  bit  product  vector.  Figure  2.2  illus¬ 
trates  this  concept  for  the  multiplication  of  two  8-tit 
operands  and  Figure  2.3  introduces  a  convenient  dot  repre¬ 
sentation  of  the  same  multiplication.  As  can  be  seen  from 
Figure  2.2,  multiplying  two  8-bit  operands  results  in  eight 
partial  products  which  are  added  to  form  a  16-bit  final 
product. 
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Figure  2.3  Dot  Representation. 

B.  SIHniTiHEOOS  HITRIX  GEBERATION  AHD  BEDOCTIOH 

In  terms  of  speedy  the  basic  add-and-shi£t  algorithm  is 
the  slowest  of  the  multiplication  schemes.  One  method  to 
improve  the  speed  of  the  basic  sequential  multiplier  is  to 
perform  as  many  operations  as  possible  in  parallel.  Ihis 
method,  known  as  the  Simultanoeus  Matrix  Generation  and 
Reductior  method  [Bef.  4:  pp.  132^147],  is  composed  of  three 
distinct  steps.  In  the  first  step,  all  of  the  partial  prod¬ 
ucts  are  simultaneously  generated.  In  the  next  step,  the 
resultant  matrix  of  partial  products  is  reduced  using  carry- 
save  addition  (CSA)  until  two  vectors  remain.  Finally,  the 
two  remaining  vectors  are  added  together  tc  form  the  final 
product. 

lartial  Produc ts  Gener  ation 

Ihe  simplest  way  to  generate  each  bit  position  of 
the  partial  products  is  to  use  the  logical  AND  operation  as 
a  1x1  multiplier.  For  example,  in  Figure  2.2,  each  of  the 
terms  in  the  eight  partial  products  is  the  result  of  a 
logical  AND  operation  and  also  corresponds  to  a  single  dot 
in  each  of  the  partial  products  of  Figure  2.3.  For  an  n-tit 


which  is 


Bulti plication  this  scheme  requires  nrn  AND  gates, 
a  simple,  tut  hardware  intensive  scheme. 

It  is  possible  to  use  encoding  techni<^ues  that  will 
reduce  the  number  of  partial  products.  One  such  method  that 
reduces  the  number  or  partial  products  by  half  is  the  modi¬ 
fied  Booth’s  algorithm.  For  a  description  of  both  Booth's 
original  and  modified  algorithms,  the  reader  is  referred  to 
two  presentations  of  these  topics  [fiefs.  4,5:  pp.  132-137, 
152-157]. 

Another  way  tc  generate  partial  products  is  to  use 
read  only  memories  (fiOMs) .  For  example,  the  8x8  multiplica¬ 
tion  cf  Figure  2.2  can  be  implemented  using  four  256x8  aCMs 
where  each  ROH  performs  a  table  lookup  multiplication,  as 
shown  in  Figure  2.4. 

In  Figure  2. 4,  the  4-bit  value  of  each  element  of 
the  pairs  (Y0,X0),  (Y0,X1),  {Y1,X0),  and  (Y1,X1)  is  ccncat- 
enated  tc  form  an  8-bit  address  into  the  ROM  table.  The  RON 
location  corresponding  to  the  address  contains  a  unicue 
8-bit  product.  Thus  four  tables  are  required  to  simultane¬ 
ously  form  the  products  YlxXI,  YlxXO,  YOxXI,  and  YOxXO. 
Note  that  the  YOxXO  and  YlxXI  terms  have  disjoint  signifi¬ 
cance,  thus  only  three  terms  must  be  added  to  form  the  final 
product.  The  number  of  rearranged  partial  products  which 
must  be  summed  is  referred  to  as  the  matrix  height  h.  This 
height  corresponds  tc  the  number  of  initial  inputs  to  the 
C3A  tree.  A  generalization  of  this  scheme  for  up  to  a  54x64 
bit  multiplication  is  shown  in  Figure  2.5.  Each  rectangle 
in  Figure  2.5  [Ref.  4:  p.  138]  represents  a  4x4  EON  multi¬ 
plier  product. 

Table  I  [fief.  4:  p.  139]  summarizes  the  maximum 
height  of  the  partial  products  for  the  three  partial  product 
generation  schemes  discussed  in  tnis  section. 

In  the  final  design  implemented  in  this  thesis,  the 
partial  products  were  generated  using  the  1x1  multiplier 


Figure  2.4 


Ad  8x8  Hultiplication  Osing  ROMs 


(AND  gate)  method.  This  method  was  chosen  over  the  ether 
two  tecause  of  i  «■  s  simple  and  regular  implementation. 
Booth's  algorithm  was  rejected  as  a  choice  due  to  the 
complex  nature  of  the  control  signals  that  are  required. 
The  FC?1  partial  product  generation  method  was  not  chosen 
tecause  it  would  reguire  16  F;OKs  of  65536  x  16  hits  to 
simultaneously  generate  the  16  partial  products  needed  in  a 
16-Lit  multiplier.  Other  possible  combinations  of  different 
size  FCris  could  also  be  used  to  generate  the  partial  prod¬ 
ucts,  but  due  to  chip  area  and  feature  size  limitations 
imposed  ly  .'lOSIS  the  EOH  method  of  generating  partial  prod¬ 
ucts  was  rejected  because  it  was  not  feasible  to  construct 
on  a  single  chip. 
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Figure  2.5  EOM  Hultiplier  leighted  Position  Structure. 

2.  fartial  Products  Seduction 

Cnee  the  partial  products  are  generated,  the  next 
step  is  to  reduce  the  n  partial  products  down  to  two.  Cne 
technigue  that  can  be  used  to  accomplish  this  is  to  utilize 
j-input,  2-output  full  adders  performing  CSA  in  a  Wallace 
tree  structure. 

The  partial  products  for  the  8x8  multiplication 
represented  by  Figure  2.3  can  be  viewed  as  adjacent  columns 
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TABLE  I 

Matrix  Haight  fox  Partial  Product  Generation  Methods 


1 

i 

1  SCHEME 

GENERAl. 

FORMULA 

MW  HEIGHT  OF  THE  \UTRIX 

Numlier  of  Hits 

8  16  24  112  40  48  56  64 

1  1  >'  1  multiplier  (AND  gate) 

n 

8 

16 

24  1  32 

40 

48  j  56 

64 

i  4  X  -1  multiplier  (ROM) 

{n/2)  -  1 

3 

t 

11  i  15 

19 

23  !  27 

31 

1  8  X  8  muliipliei  (ROM) 

(h/4)-  1 

1 

3 

4  i  7 

9 

11  i  13 

15 

i  Moditieii  Booth's  alcorithm 

(n/2) 

4 

8 

12 : 16 
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o£  height  h,  vhere  each  coluan  corresponds  to  all  terms  to 
the  same  power  of  2,  as  shown  in  the  Wallace  tree  structure 
of  Figure  2.6. 


Figure  2.6  Partial  Products  in  Wallace  Tree  Structure. 

lo  reduce  these  columns  of  height  h,  CSA  is  used  to 
reduce  three  dots  of  column  height  to  two  dots.  These  two 
output  dots,  which  represent  the  familiar  sum  and  carry 
outputs  of  a  full  adder,  are  placed  in  the  next  level  of  the 
tree  structure  in  their  appropriate  power  positions.  In 
general,  the  number  cf  reguired  levels  (L)  of  CSA  required 
to  reduce  a  Wallace  tree  structure  of  column  height  b  tc  two 
is  given  by  Equation  2.2  [Bef.  4:  p.  139].  L  can  also  be 
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viewed  as  the  ainiaus  nuobec  ox  full  adder  delays  reguired 
to  produce  the  pair  cl  colusn  operands.  For  an  8x8  aulti- 
plication,  the  aaxiaua  coluan  height  is  h-8.  Thus,  four 
levels  of  CS&  are  required  as  illustrated  in  Figure  2.7 
[Ref.  4:  p.  141]. 

^  (egn  2.2) 


Table  II  [Ref.  4:  p.  139]  shows  the  nuaber  of  carry-save 
adder  levels  corresponding  to  various  coluan  heights. 

3.  Carry  Look-Ahead  Addition 

The  final  step  in  this  aultiplication  scheae  is  to 
sua  the  two  reaaining  vectors  created  by  the  CSA  reduction 
scheae  discussed  in  the  previous  section.  The  aajcr  consid- 
eraticn  in  the  choice  of  addition  aethods  for  the  final 
suamation  is  speed  of  operation.  One  aethod  that  signifi¬ 
cantly  reduces  the  nuaber  of  gate  delays  and  increases  the 
speed  over  ripple  carry  addition  is  carry  lookahead  (CLA) 
addition.  Bather  than  give  a  full  derivation  of  the  CLA 
addition  concept  [Ref.  5:  pp.  84-91],  the  basic  operation  is 
presented  for  the  32-bit  CLA  adder  that  is  used  in  the  final 
design  iipleaented  in  this  thesis. 

Figure  2.8  represents  the  designed  32-bit  CIA  adder 
which  can  be  thought  c£  as  operating  in  three  steps.  First, 
the  two  input  vectors  X  and  I  to  be  suaaed  are  broken  into 
4-bit  blccXs.  These  blocks  are  routed  into  a  circuit  called 
a  block  P  £  G  generator.  The  block  P  &  G  generator  looks  at 
each  4-bit  block  froa  X  and  to  deteroine  if  a  carry  into 
the  least  significant  bit  position  will  propogate  to  the 
carry  out  of  the  most  significant  bit  position  of  the  block. 
The  logic  equations  for  these  two  signals,  called  block 
propogate  (Pn)  and  block  generate  (6n)  respectively  for  bit 
positicn  n,  are  given  in  Equations  2.3  and  2.4  for  the  nth 
bit  position.  Equations  2.3  through  2.15  are  derived  froo 
[Ref.  5:  pp.  84-91]. 
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AHEAD 


CS&  Seduction  for  an  8-bit  Hultiplication 
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levels  of  CSl  Seeded  vs.  Baxisus  Colusn  Height 


Column  Height  (h) 

Number  of  Levels  (L) 

3 

1 

4 

2 

4  <  n  <  6 

3 

6  <  n  <  9 

4 

9  <  n  <  13 

5 

13  <  n  <  19 

6 

19  <  n  <  28 

7 

28  <  n  <  42 

8 

42  <  n  <  63 

9 

Pn  =  (^.  +  y.)  (eqn  2,3) 

G.  =  X.y.  +  (>r.  +  y.)x._,K,-,+  (eqn  2.4) 

+  ^  -^.-s  ^"•-s 


Next,  the  block  P  and  G  signals  are  input  into  a  CLA 
unit  that  generates  the  true  carry  Cn  out  of  the  next  least 
significant  block  C  (e-1) .  For  a  32-bit  addition,  two  CLA 
units  are  required.  The  equations  for  the  lover  order  CLA 
unit  are  given  in  Equations  2.5,  2.6,  2.7,  and  2.8. 

G,+  PoG,n 

Ct  =  Gj -t  PjGi -t- PjpjC.^  (eqn  2.6) 

G„  =  c„  +  P^,Gj  +  P„PjG,  +  PuPiPiC^  (eqn  2.7) 

G|6  =  G,J  +  +  P,sP,,<7^  + + /’ijPuP^PjC^  (€qn  2.8) 

Since  in  a  aultiplication  of  tvo  nunbers  the  carry  into  the 
least  significant  bit  position  is  zero,  the  above  four  equa¬ 
tions  reduce  to  Equations  2.9,  2.10,  2.11,  and  2.12. 


G, 


(eqn  2.9) 


Figure  2.8  Block  Diagraa  of  a  32-bit  Clk  Adder 


C^  =  G 1  +  P  fG  } 


(eqn  2.10) 
(egn  2.11) 


^  P ii^t  P uP T^j 

Ci»  =  G ii-^  P  15^11  +  P\iP uGf  -f-  P ijP iiP 7G’,  (egn  2.  12) 

Siailarly,  the  eguaticns  for  the  upper  CLA  unit  are  gives  as 
Eguations  2.13,  2.14,  and  2.15. 

^^jo=  G„+ p„<7„  (egn  2.13) 

Cj4  =  Gjj  +  P jsGjg  +  /^jjPijCjj  (egn  2. 14) 

Gn  -  Gy,  +  P jtGjj  +  P „P m<7i9  +  P viP 2iP mG (egn  2.  15) 

Note  that  the  carry  out  of  the  most  significant  hit  is 
disregarded.  This  is  because  the  result  of  multiplying  tvo 
16'*bit  operands  yields  only  a  32-bit  result. 

Finally,  the  carry  signals  generated  by  the  previous 


two  steps  are  added  in  4-bit  block  ripple  carry  adders  with 
their  appropriate  slices  of  ^  and  X  form  the  32-bit  sum. 
Note  that  the  carry  cut  of  each  4-bit  ripple  carry  adder  is 
disregarded,  as  it  was  generated  and  used  previously. 

C.  PIPEIINED  ADAPTillON 

In  the  previous  section,  the  implementation  cf  a 
parallel  CSA  multiplier  was  described.  This  method  can 
logically  be  partitioned  into  stages  for  realization  as  a 
pipelined  design. 

In  pipelining  any  design  or  algorithm,  the  basic  objec¬ 
tive  is  tc  introduce  concurrency  by  taking  the  function  to 
be  performed  and  partitioning  it  into  several  subfunctions. 
The  following  properties  [Ref.  6:  p.  4]  are  important  to 
consider  when  pipelining  a  design: 

1.  Evaluation  of  the  basic  function  is  eguivalent  to  some 
seguential  evaluation  of  the  subfunctions. 

2.  The  inputs  for  one  subfunction  come  totally  from  the 


outputs  of  the  previous  subfunctiou  in  the  evaluation 
sequence. 

3.  Other  than  the  exchange  of  inputs  and  outputs,  there 
are  no  interrelationships  between  subfunctions. 

4.  Hardware  can  be  developed  to  execute  each  subfunction. 

5.  The  times  required  for  these  hardware  units  to  perform 
their  individual  evaluations  are  usually  approximately 
equal. 

The  hardware  required  to  perform  each  subfunction  of  a 
pipeline  is  called  a  stage.  At  the  output  of  each  stage  is 
a  latch  that  is  used  to  perform  the  actual  exchange  of  oper¬ 
ands  between  stages. 

To  partition  the  CSA  multiplier  into  its  stages,  a 
logical  division  of  the  subfunctions  to  be  executed  must  be 
determined.  One  method  that  initially  may  come  to  mind  is 
to  make  the  partial  product  reduction  scheme  using  the 
Hallace  tree  structure  as  one  stage  of  the  pipeline  and  the 
CLA  addition  as  a  second  stage.  This  was  rejected  because 
for  a  16-bit  multiply,  the  first  stage  would  require  six 
full  adder  delays  and  an  AND  gate  delay  before  being  ready 
to  be  latched.  In  the  second  stage,  the  CLA  adder  would 
require  the  delay  for  the  P  and  G  generation,  the  true  carry 
generation  in  the  CIA  unit,  and  four  full  adder  delays 
before  being  ready  to  be  latched. 

The  next  partitioning  of  subfunctions  vent  one  level 
further  into  defining  each  stage.  The  CLA  adder  was  further 
subdivided  into  three  subfunctions.  The  first  stage 

performs  the  generation  of  the  P  and  G  signals  based  on  the 
two  32-bit  input  vectors.  The  next  stage  uses  the  P  and  G 
signals  generated  in  the  previous  stage  to  produce  the  true 
carry  signals.  In  the  third  and  final  stage  of  the  CLA 
adder,  the  4-bit  blocks  are  summed  with  their  appropriate 
carry  in  signals  generated  in  the  previous  stage  to  form  the 
final  product.  In  looking  at  the  CLA  adder  portion,  the 


longest  delay  occurs  in  the  final  stage.  Ihxs  delay  has  a 
aagnitude  of  4  full  adder  delays  and  it  is  this  figure  that 
is  used  to  partition  the  Wallace  tree  reduction  scheme  into 
stages. 

For  a  16-bit  multiplication,  the  maximum  height  of  the 
Wallace  tree  is  sixteen  as  shown  in  Table  I.  This  maximum 
height  requires  six  levels  of  CSA  addition  (see  Table  II) 
before  a  cclumn  height  of  two  is  obtained  to  be  input  into 
the  CIA  adder.  Also  to  be  performed  in  this  stage  is  the 
generaticn  of  each  bit  of  the  partial  products  through  the 
use  of  AWb  gates.  Starting  at  the  beginning  of  the  Wallace 
tree  structure  and  keeping  the  stage  delay  at  less  than  the 
four  full  adder  delays  of  the  CIA  adder,  the  1x1  multiply 
and  three  levels  of  CSA  can  be  accomplished  in  the  first 
stage  of  the  pipeline.  This  leaves  the  next  stage  of  the 
pipeline  with  the  remaining  three  levels  of  CSA  to  perform 
before  goirg  into  the  32-bit  CIA  adder  for  the  generaticn  of 
the  final  product.  Figure  2.9  shows  each  stage  of  the  pipe¬ 
line  and  its  subfunction.  This  pipelined  structure  is  to  be 
the  one  inplemented  in  the  final  design  of  this  thesis  with 
adaptations  to  allow  for  the  implementation  of  a  two»s 
complement  multiplier. 


Ixi  MULTIPLIERS 


III.  DESIGI:  1 6-BlT  THOIS  COHPIEaEHT  MOLTIPLIEH 

i.  TiO*S  COEPLEEEHT  EUITIPLIEB 

1 .  theoretical  Architecture 

the  multiplication  of  two  16-bit  signed  numbers 
represented  in  two's  complement  form  can  be  performed 
through  the  implementation  of  Eguation  3.1  [Ref.  3]  where  n 
eguals  sixteen.  In  Iguation  3.1,  the  notation  b*  denotes 
the  one's  complement  of  the  multiplicand. 


P  =  "S 


=  "if  +  l) 


(egn  3.1) 


Each  partial  product  generated  through  the  use  of  Equation 
3. 1  is  summed  with  the  remaining  partial  products  as  in  the 
unsigned  CSA  multiplier  discussed  in  the  previous  chapter 
with  two  exceptions.  First,  each  partial  product  must  have 
its  most  significant  bit  extended  to  the  most  significant 
bit  of  the  final  product.  In  the  design  used  in  this  thesis 
for  16-lit  operands,  the  most  significant  bit  of  each 
partial  product  must  be  extended  to  bit  position  31. 
Second,  the  most  significant  bit  of  the  multiplier  must  be 
added  into  bit  position  15.  This  insertion  of  the  most 
significant  bit  of  the  multiplier  can  also  be  accomplished 
by  inserting  it  twice  into  the  final  summation  at  bit  posi¬ 
tion  13  and  once  into  each  of  the  bit  positions  14  and  15. 
This  is  done  in  the  final  design  of  this  multiplier  to  keep 
the  maxittum  column  height  to  be  input  to  the  Wallace  tree 


reduction  scheme  at  sixteen.  Figure  3.1  demonstrates  the 
use  of  this  equation  directly  on  the  multiplication  c£  tvo 
4-bit  two*s  complement  numbers  vhere  n  equals  four. 
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Figure  3.1  Iiio*s  Complement  Hultiplication 


Fxgure  3.2  shovs*  xn  dot  notatxon/  the  partxal  prod¬ 
ucts  generated  vith  1x1  aultipliers  using  Equation  3.1  with 
the  tvo  exceptions  discussed  above  for  a  16-bit  two's 
coapleaent  aultiplication.  It  is  this  structure  that  is 
input  into  the  Wallace  tree  reduction  scheme  to  be  reduced 
to  a  final  maxinum  ccluan  height  of  two.  Since  the  maximum 
column  height  is  sixteen  for  the  16-bit  two's  cooplement 
multiplication  presented  in  this  thesis,  six  levels  of  CSA, 
as  shown  in  Figures  3.3  and  3.4,  are  required  to  decompose 
this  structure  to  a  maximum  column  height  of  two.  The 
resulting  tvo  vectors  generated  by  the  C5A  are  then  input 
into  the  CXA  adder  presented  in  the  previous  chapter. 

One  interesting  point  to  note  is  that  the  column 
height  fer  certain  cclumns  is  only  one.  This  is  caused  when 
eSA  is  performed  on  three  or  less  operands  in  a  column  and 
no  carry  into  that  column  is  produced  by  the  next  lower 
significant  one.  In  these  operand  vectors,  a  zero  is  in^ut 
for  the  appropriate  bit  position  into  the  CLA  adder. 

To  perform  this  aultiplication  in  a  pipelined 
manner,  latches  must  be  inserted  at  the  end  of  each  stage  of 
the  pipeline  as  discussed  earlier.  Since  the  first  stage 
involves  a  1x1  multiplication  to  generate  the  partial  prod¬ 
ucts  and  three  levels  of  CSA,  the  first  latch  must  be 
inserted  at  the  end  of  the  third  level  of  CSA.  At  this 
point,  143  bits  of  data  must  be  transferred  to  the  second 
stage.  Therefore,  the  first  latch  is  143  bits  wide. 
Similaily,  the  second  stage  ends  after  the  sixth  level  of 
CSA  is  performed.  This  requires  the  second  latch  to  be  57 
bits  vide.  These  57  bits  are  then  input  to  the  CLA  adder. 
The  third  stage  of  the  circuit  generates  the  block  P  and  S 
signals.  These  signals  and  the  57  bits  of  the  two  CIA  oper¬ 
ands  are  then  transferred  to  the  fourth  stage  in  a  70  bit 
wide  latch.  The  fourth  stage  uses  the  P  and  G  signals  to 
generate  the  true  carry  signals  to  be  used  in  the  fifth  and 
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Figure  3.4  Partial  Product  Reduction  Using  CSA  (cont'd.). 
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final  stage.  This  requires  a  64  bit  latch  at  its  output  to 
hold  the  carry  signals  and  the  tvo  CLA  operand  vectors.  The 
final  product  appears  at  the  output  of  the  fifth  stage  and 
is  stored  in  a  32  bit  vide  latch  so  that  latched  outputs  can 
be  provided  to  any  subsequent  circuits  that  this  aultiplier 
may  drive. 

2 .  Actual  Implementation 

The  initial  floorplan  for  the  circuit  is  shown  in 
figure  3.5.  This  flcorplan  closely  follows  the  theoretical 
implementation  with  tvo  exceptions. 

First*  in  a  VISI  design,  an  AMD  gate  used  as  a  1x1 
multiplier  is  implemented  with  a  MAMD  gate  followed  by  an 
inverter.  This  active- high  signal  is  then  input  to  an 
activcThigh  input,  active-high  output  full  adder  in  the 
first  level  of  CSA.  Bather  than  construct  these  two  circuit 
elements  in  this  manner,  the  actual  implementation  utilized 
a  NAKD  gate  as  the  1x1  multiplier  driving  an  active-low 
input,  active-high  output  full  adder.  Any  signal  generated 
with  a  SAND  gate  as  a  partial  product  bit  that  is  not  used 
in  the  first  level  of  CSA  is  simply  routed  through  an 
inverter  to  convert  it  to  an  active-high  signal  for  use  in 
subsequent  levels  of  CSA.  This  provided  a  reduction  of  256 
in  the  number  of  inverters  to  be  constructed. 

Second,  the  sign  bits  or  each  of  the  partial  prod¬ 
ucts  must  be  extended  to  bit  position  thirty-one.  These 
extended  tits  must  also  be  added  in  the  Wallace  tree  reduc¬ 
tion  of  the  partial  products.  When  these  sign  tits  are 
grouped  for  input  to  a  full  adder  in  the  first  level,  up  to 
fourteen  adders  have  the  same  three  inputs.  Rather  than 
duplicate  the  adders  which  would  increase  power  consumption 
and  usage  of  chip  area,  only  one  adder  was  used  to  calculate 
the  sum  and  carry  inputs  to  the  next  level  of  CSA.  These 
high  fancut  sum  and  carry  inputs  are  then  super buff ered  to 


Figure  3.5  Initial  Floorplan. 

drive  the  second  level  of  CSA.  This  resulted  in  a  savings 
of  thirty-five  full  adders  not  having  to  be  iDplenented  in 
silicon. 

The  clocking  of  the  circuit  is  accomplished  hy  a 
non-overlapping  two-phase  clock.  Both  phases  are  input  to 
the  circuit  through  separate  input  pads.  An  additional 
signal  called  OP  is  provided  to  allow  for  the  implementation 
of  a  level  sensitive  scan  design  (LSSD)  [Bef.  7].  In  a 
ISSOf  the  contents  of  the  latches  are  either  loaded  in 
parallel  when  OP  is  a  high  or  serially  shifted  to  an  output 
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pad  and  serially  loaded  froa  an  input  pad  when  OF  is  low. 
This  allows  the  contents  of  each  of  the  first  four  latches 
to  be  ezanined  to  aid  in  the  detection  of  fabrication  errors 
or  circuit  naif unctions.  The  output  latch  is  not  serially 
loaded  or  shifted  to  an  output  pad  because  its  contents  are 
directly  available  at  the  output  pads. 

B.  D1SI6B  TOOLS 

Before  the  actual  layout  of  a  VLSI  circuit  can  be  under¬ 
taken,  certain  CAO  tools  are  needed  by  the  designer.  First, 
a  graphical  layout  editor  is  reguired  to  allow  the  designer 
to  construct  a  VLSI  circuit.  Second,  to  allow  for  the 
inpleaentation  of  conpler  logic  functions,  a  PLA  generator 
is  desired.  Next,  the  ability  to  enploy  a  design  rule 
checker  on  a  layout  is  essential  to  insure  that  design  rule 
violations  do  not  unintentionally  occur.  Finally,  tools 
that  perfoxa  circuit  siaulation  for  logic,  tiaing,  and  power 
consuaption  are  useful  in  determining  the  proper  operation 
of  the  designed  circuit. 

In  the  design  of  the  16-bit  pipelined  multiplier,  the 
CAESAB  layout  editor  [Befs.  7,8]  was  used  as  the  basis  for 
the  layout  of  the  entire  chip.  To  facilitate  the  design  of 
complex  logic  functions,  EQNTOIT  [Ref.  9]  and  TPLA  [Bef.  9] 
were  eaployed  to  construct  coaplex  prograamed  logic  arrays 
{PLAs) .  lYBA  £Bef.  9]  was  used  to  perform  design  rule 
checks  on  the  circuit.  Circuit  siaulation  for  logic, 
tiaing,  and  power  were  performed  by  ESin  [Refs.  2,9], 
CRYSTAL  [Befs.  10,11]  and  PCNEST  [Bef.  9]  after  a  node 
extraction  was  perforaed  using  HEXTBA  [Bef.  9]. 

The  aanuals  for  each  of  the  CAD  tools  discussed  above 
are  available  on  the  NFS  Computer  Science  Oepartaent*s  UNIX 
operating  systea.  To  obtain  an  on-line  copy  of  the  manual 
for  a  specific  design  tool,  issue  the  coaaand 


lo  obtain  a  hardcopy  of  a  certaia  CAD  tool  aanual,  issue  the 
coaaand 


X  cadaan  < design  tool  naBe>  |  Ipr. 

Ihis  ccanand  will  send  a  copy  of  the  noraal  CAD  aanual  to 
the  iineprinter. 

1.  ECHTOTT 

ECNTOTT  is  a  prograa  uhich  generates  a  truth  table 
suitable  for  input  tc  TPLA  from  a  set  of  Boolean  equations 
vhich  define  the  PLA  outputs  in  terms  of  its  inputs.  The 
equation  syntax  is 

MAHB  =  EXPEESSIOM; 

where  UAHS  is  the  output  variable  name  and  EXPRESSICH  is  a 
Boolean  equation  in  sum  of  products  (SOP)  form  that  repre~ 
seats  the  output  variable  in  terns  of  its  inputs.  In  the 
SOP  expression^  the  8  symbol  denotes  the  logical  ARDr  the  ] 
symbol  denotes  the  logical  0B«  and  the  !  symbol  preceeding 
an  operand  denotes  the  logical  inversion.  The  input  and 
output  signal  order,  from  left  to  right  or  top  to  bottom,  as 
appropriate,  can  be  ccntrolled  with  the  INOBDEP.  and  CUTCBDER 
commands. 

2.  TPXA 

IPIA  is  a  technology  independent  PLA  generator  that 
supports  design  rules  in  the  following  styles: 

1.  Mead'Conway  NMCS  with  butting  contacts,  no  buried 
contacts. 

2.  Bead-'Conway  NHCS  with  buried  contacts,  no  butting 
contacts. 


3.  MCSIS  3  micron  bulk  CHOS 


It  taX€S  as  its  input  the  output  of  EQHTOTT  and  generates  a 
PLA  layout  in  the  desired  technology.  The  default  output 
option  is  a  CAESAR  file.  TPLA  can  provide  inputs  and 
outputs  on  either  the  sane  side  (cis  version)  or  on  opposite 
sides  (trans  version)  of  the  generated  PLA.  In  addition, 
clocked  inputs  and/or  outputs  can  be  supported  by  TPLA 
through  another  option  selection. 

3  •  HBA 

LIRA  is  a  design  rule  checker  that  operates  on 
graphical  files  in  CAESAR  format.  It  can  be  invoked  either 
interactively  while  editing  a  CAESAR  file  or  on  a  CAESAR 
file  and  run  in  the  background  on  the  ONIX  operating  system. 
The  interactive  mode  is  discussed  in  earlier  work  done  by 
Reid  £Ref.  7].  In  the  background  mode,  LYRA  is  invoked  by 
executing  the  command 

lyra  filename. ca  5. 

This  generates  a  file  named  CHECKPT  which  contains  the  names 
of  all  subcells  of  the  design  being  checked  that  have 
completed  a  design  rule  check.  If  an  error  is  found  in  the 
parent  cell  or  any  of  its  subcells,  a  file  with  the  same 
name  of  filetype  .ly  is  output  to  the  user's  current  working 
directory.  This  file  contains  all  error  information  and  can 
be  edited  using  CAESAR  to  view  the  errors  for  further 
correction.  This  mcde  of  operation  for  LIRA  provides  an 
excellent  means  for  design  rule  checking  large  designs  that 
normally  would  take  a  long  time  in  the  interactive  mode. 

C.  LAYOUT 

Cnee  the  designer  has  determined  the  architecture  to  be 
implemented,  the  initial  floorplan,  and  has  mastered  the  CAO 
tools  that  are  available,  the  next  step  in  the  design  cycle 
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IS  to  tegin  the  layout  of  the  actual  coxcoxt.  One  technique 
that  is  utilized  in  this  design  of  a  16-bit  pipelined  aulti- 
plier  is  a  fora  of  tte  hierarchical  design  aethod.  In  this 
aethod,  once  the  above  three  iteas  are  coapleted,  the  archi¬ 
tecture  is  ezaained  to  look  fcr  soae  basic  build! eg  blocks 
that  could  be  designed  euid  used  repeatedly  in  the  construc¬ 
tion  of  the  circuit.  Upon  ezaaination  of  the  architecture 
for  the  16-bit  pipelined  aultiplier,  the  four  basic  circuit 
eleaents  that  can  be  designed  and  iterated  throughout  the 
circuit  are  a  full  adder «  a  4- bit  block  P  and  G  generator,  a 
CLi  uni  t,  and  a  1-bit  latch  cell. 

Ihe  full  adder  is  the  main  eleaent  in  both  of  the  first 
tvo  stages  in  rhe  pipeline  as  veil  as  a  basic  buildiig  block 
for  the  4-bit  ripple  carry  adders  in  the  fifth  stage.  Ihe 
first  tvo  aethods  of  ioplenentation  that  immediately  arise 
are  ccnstructing  an  adder  by  using  either  discrete  gates  or 
a  PLA  generator  such  as  TPLi.  A  third  aethod  [Ref.  12]  that 
is  possible  is  to  use  pass  transistors  in  a  selector  logic 
circuit  tc  generate  the  sum  and  carry  bits  that  are  condi¬ 
tioned  on  the  three  input  bits  to  be  added. 

In  choosing  the  adder  to  be  implemented,  tvo  main 
considerations  in  the  selection  of  the  adder  are  its  speed 
and  pover  consumption.  Both  the  discrete  gate  and  the  PLA 
adders  have  a  higher  static  pover  consumption  than  the 
selector  adder  because  they  contain  more  depletion  pull-up 
transistors  than  the  selector  adder.  After  simulation  of 
these  circuits  for  speed  using  CBISTAL,  it  vas  found  that 
the  selector  circuit,  vith  a  14.7  nanosecond  propagation 
delay,  vas  faster  than  both  of  the  other  tvo  by  at  least  tvo 
nanoseconds.  Therefore,  the  selector  adder  was  chosen  as 
one  of  the  basic  building  blocks  of  the  circuit.  Figure  3.6 
shows  a  circuit  diagram  of  the  selector  adder  used  in  the 
design  of  the  16-bit  lultiplier.  Two  minor  drawbacks  exist 
to  the  selection  of  this  type  of  add* r.  When  the  outfut  of 


one  adder  drives  the  input  of  another,  this  is  equivalent  to 
the  output  of  a  pass  transistor  driving  an  inverter.  To 
insure  that  the  following  adder  inputs  are  driven  to  the 
necessary  voltage  levels  to  operate  properly,  the  input 
inverters  to  each  vertical  selector  rail  oust  have  a  pull-up 
to  pull-down  ratio  of  eight.  Also,  the  selector  rail  that 
provides  the  true  signal  to  the  circuit  oust  pass  through 
two  inverters.  This  prevents  the  output  of  a  pass  tran¬ 
sistor  in  the  previous  adder  froo  directly  driving  the  gate 
of  a  pass  transistor  in  the  current  adder  [Ref.  1:  pp. 


Both  the  4-bit  block  P  and  G  generator  and  the  CIA  unit 
are  ccaplex  logic  functions  well-suited  for  inplementation 
as  PIAs.  These  two  circuit  elements  are  implemented  by 
inputting  Equations  2.3  and  2.4  (for  the  P  and  G  generator) 
and  Equations  2.9  to  2.15  (for  the  CLA  unit)  into  EQNIOIT. 
The  output  of  ZQNTOTT  is  then  piped  to  TPLA  to  generate  the 
actual  CAESAR  files  for  the  PLAs.  since  data  flows  into  one 
side  and  out  from  the  opposite  side  of  each  stage,  the  trans 
version  of  the  PLAs  was  constructed. 

The  last  building  block  of  the  circuit  to  be  designed  is 
the  1-bit  latch  cell.  Since  a  LSSD  is  an  important 
criterion  for  designing  the  16-bit  multiplier,  the  1-tit 
latch  cell  must  be  able  to  be  loaded  either  in  parallel 
along  the  data  path  or  in  serial  from  an  adjacent  latch 
cell.  This  function  is  under  control  of  the  OP  signal. 

To  minimize  the  area  consumed  by  the  latch,  a  dynamic 
latch  composed  of  a  pair  of  inverters  coupled  by  pass  tran¬ 
sistors  was  selected.  As  in  the  adder  circuit,  a  pull-up  to 
pull-down  ratio  of  eight  is  needed  for  the  inverters  because 
they  are  driven  by  pass  transistors.  Figure  3.7  shows  the 
circuit  diagram  of  the  1-bit  latch  cell  as  implemented.  The 
operation  of  the  latch  cell  is  as  follows.  For  normal  oper¬ 
ation  (0E=1) ,  the  NOEEAL  signal  is  high  and  the  SHIFT  signal 
is  low  during  PHII.  Data  appearing  at  the  DATA  IN  port 
drives  the  first  inverter.  When  PHII  falls,  the  gate  of  the 
first  inverter  retains  the  logic  value  of  DATA  IN  in  its 
gate  capacitance.  Nhen  PHI2  rises,  this  data  drives  the 
second  inverter  which  effectively  transfers  the  data  to  DATA 
OUT  and  the  next  stage.  For  a  shifting  operation  {OP=0)  , 
the  NCENAL  signal  is  low  and  the  SHIFT  signal  is  high.  Data 
appearing  at  the  LATCH  IN  port,  which  connects  to  DATA  OUT 
of  the  next  latch  cell  to  the  left,  charges  the  gate  capaci¬ 
tance  of  the  first  inverter.  The  pass  transistor  transfers 
the  data  to  the  second  inverter  on  PHI 2  as  in  a  normal 
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Figure  3.7  1-bit  Latch  Cell. 


operation.  This  effectively  shifts  the  data  from  the  LATCH 
IN  port  to  the  LATCH  OUT  port  in  one  cycle  of  the  clock. 
Figure  3.8  shows  the  circuitry  to  condition  PHI1  with  OP  to 
generate  the  NOBMAL  acd  SHIFT  signals  used  above. 


Once  these  four  basic  building  blocks  are  designed,  each 
stage  of  the  pipeline  and  its  latch  is  developed  out  of  the 
appropriate  subcells.  Next,  the  internal  routing  of  signals 
vithin  a  stage  is  accomplished  through  the  use  of  a  wire 
list.  Then  the  five  stages  of  the  circuit  are  wired 
together  to  form  the  core  of  the  design.  Finally,  all  that 
remains  to  be  done  is  to  connect  this  core  design  to  a  frame 
to  allow  adequate  interfacing  for  the  packaging  process. 

This  routing  of  signals  both  within  the  core  of  the 
design  and  to  the  frame  is  an  extremely  time  consuming  task 
that  requires  as  much  time,  effort,  and  planning  as  the 
design  and  layout  of  all  the  major  components.  The  addition 
of  an  automatic  router  would  be  a  welcome  addition  to  any 
designer's  CAO  toolbag. 

The  design  frame  is  composed  of  a  pad  set  that  was 
obtained  from  MOSIS,  These  pads  were  specifically  designed 
for  fabrication  at  1.5  microns  per  lambda,  A  copy  of  these 
pads  is  located  in  the  file 

/vlsi/ber k83/lib/pads 15. cif 
and  associated  documentation  can  be  found  in  the  file 
/ vlsi/ber k 83/do c/p ads15. 

Both  of  these  files  are  located  in  the  NFS  Computer  Science 
Department's  7AX11-780  running  the  (JNIX  operating  system. 

Numerous  repetitions  of  the  design  -  rule  check  -  rede¬ 
sign  cycle  occurred  before  a  final  design  was  obtained. 
Using  lYJA  for  the  design  rule  check  on  a  large  design  such 
as  the  16-bit  oultplier  requires  approximately  1000  CPU 
minutes.  When  the  UNIX  system  is  heavily  loaded,  this 
results  in  a  turn-aicund  time  on  the  order  of  two  to  three 
days.  Figure  3.9  depicts  the  final  design  of  the  entire 
chip.  Each  of  the  six  levels  of  CSA  are  shown  as  level 1 
through  level6.  The  latches  are  labelled  latchxx  where  xx 
is  the  appropriate  number  of  bits  in  the  latch.  The  block  P 
and  G  generators  are  designated  PG  and  the  CL A  unit  is 


sinpl;  shown  as  CIA.  The  4-bit  ripple  carry  alders  are 
shown  as  ADD.  Three  blocks  not  previously  discussed  are 
labelled  AHP.  These  are  control  line  drivers  that  drive  the 
high  fanout  NOBHAL«  hHIFT,  and  PHI2  signals  to  each  of  the 
latches.  These  drivers  are  conposed  of  the  same  circuitry 
used  by  the  output  pads  to  drive  off  chip  loads. 


E.  EISIGB  7AL1D1TI01 


The  next  step  in  the  design  cycle  is  to  functionally 
validate  the  chip*s  operation  tefore  it  is  sent  to  MCSIS  for 
fabrication.  This  uill  give  the  designer  a  high  degree  of 
certainty  that  the  chip  operates  logically  as  desired  with 
an  approxiaate  power  consumption  and  at  a  certain  oaxioum 
frequency  of  operation. 

Before  these  three  items  can  be  accomplished/  two 
preliminary  steps  must  be  accomplished.  First/  the  CAESAR 
file  must  be  edited  to  label  the  nodes  and  a  Caltech 
Intermediate  Format  (CIF)  file  generated.  For  the  purpose 
of  performing  design  validation  using  CAD  tools,  the  scale 
of  centisicrons  per  lambda  must  be  an  even  multiple  of  four. 
This  prevents  round-cff  errors  in  the  resultant  CIF  file. 
Since  the  final  design  is  to  be  fabricated  at  lambda  equals 
1.50  micronS/  152  centimicrons  per  lambda  is  used.  Second, 
the  CIF  file  oust  be  passed  through  the  MEXTHA  program  using 
the  command 

5E  mextra  -o  filename. cif  & 

so  that  a  node  extraction  is  performed  on  the  circuit.  On 
large  files,  it  is  extremely  useful  to  run  this  program  in 
the  background  mode  as  shown  by  the  >  in  this  command.  A 
large  CIF  file  such  as  the  one  for  the  16-bit  multiplier  can 
take  up  to  thirty  minutes  of  CPO  time  to  run.  When  the  UNIX 
system  is  heavily  loaded,  this  requires  eight  to  ten  hoars 
of  real  time.  The  output  files  are  directly  compatible  with 
the  CAD  simulation  tods  to  be  used. 

^ •  logical  Simulation 

The  first  step  in  any  design  validation  process  is 
to  determine  if  the  circuit  functions  as  it  was  designed  to. 
Today,  as  the  complexity  of  7LSI  designs  increases,  the 
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nunber  cf  possible  inputs  goes  up  tremendously.  For 
example,  to  exhaustively  test  just  the  normal  operation  of 
the  1€-bit  multiplier  would  require  each  possible  combina¬ 
tion  cf  the  16-bit  multiplier  and  multiplicand  inputs.  Ihe 
number  of  possible  ccxbinations  of  the  vectors  a  and  b  is 

(216)2  =  232  =  4,294,967,296. 

Ihe  ESIM  logic  simulator  is  the  CAD  tool  to  be  used 
for  checking  operation  of  the  16-bit  multiplier.  If  a 
vector  pair  is  input  only  once,  without  regard  to  order,  and 
at  an  estimated  rate  cf  two  test  vector  pairs  simulated  per 
minute,  this  would  require 

4,294,967,296  vectcrsxl  day/2880  tests= 1.49x10*  days. 

This  ancunts  to  over  4085  years  required  to  per fori  an 
exhaustive  test. 

Therefore,  seven  representative  pairs  of  test 
vectors  were  selected  for  simulation  to  determine  if  the 
circuit  operates  correctly.  Exhaustive  testing  is  not 
possible,  but  most  possible  errors  would  be  revealed  by 
these  few,  carefully  chosen  test  vectors.  These  seven  test 
vectors  are: 

1.  +143  X  +27 

2.  -143  X  +27 

3.  +143  X  -27 

4.  -1«I3  X  -27 

5.  +1123  X  +891 

6.  -1123  X  +891 

7.  -32768  X  -32763 

These  vectors  were  designed  to  test  as  large  a  number  of 
subcircuits  as  possible.  The  first  four  vector  pairs  test 
the  basic  architecture  for  the  correct  implementation  cf  the 
algorithm  represented  by  Equation  3.1.  The  positive/ 


negative  and  negative/negative  test  vector  pairs  also  test 
the  CIA  adder’s  ability  to  produce  a  proper  sun  over  the 
entire  thirty-two  bit  width.  The  next  two  vector  pairs  test 
the  ability  of  the  CSA  in  the  iallace  tree  reduction  scheme 
to  produce  a  correct  result  in  the  upper  sixteen  bits  of  the 
product.  The  last  test  vector  is  the  largest  negative 
number  representable  in  16-bit  two’s  complement  form. 
Further  simulation  with  additional  test  vectors  would 
increase  the  confidence  of  the  designer  in  the  ability  of 
the  circuit  to  properly  simulate  a  16-bit  two’s  complement 
multiplication  prior  to  fabrication. 

Cnee  the  read-in  of  the  .sim  file  by  ESIfl  is 
completed,  the  initialization  of  the  circuit,  the  defining 
of  watched  nodes,  and  describing  the  clock  cycles  must  be 
accomplished  before  any  simulation  is  performed.  Rather 
than  do  this  each  time  SSin  is  entered,  a  macro  file  was 
created  that  is  called  at  the  beginning  of  each  session. 
This  file  is  called  init_esin  and  is  shown  in  Figure  3.10 
for  the  16-bit  multiplier.  The  input  vectors  for  the  two 
operands  are  represented  as  ain  and  bin.  The  resultant 
product  vector  is  shewn  as  phigh  and  plow  representing  the 
upper  and  lower  16-bits  of  the  16-bit  product,  respectively. 
The  latch  input  and  output  signals  are  represented  as  the 
vectors  latchin  and  latchout  where  the  leftmost  tit  corre¬ 
sponds  to  the  first  latch  and  the  rightmost  bit  tc  the 
fourth  latch. 

After  initialization  of  the  circuit  by  executing  the 
init^esim  macro,  at  each  clock  cycle  the  seven  test  vector 
pairs  previously  defined  are  input  in  sequential  order.  In 
each  case,  on  the  fifth  clock  cycle  after  introduction  of  a 
test  vector,  the  correct  product  appeared  at  the  output  pads 
phigh  and  plow.  This  demonstrates  that  the  circuit  can 
properly  multiply  two  16-bit  two’s  complement  operands  to 
yield  a  16-bit  result  with  the  result  dependent  onlv  on  the 
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Figure  3.10  Initialization  Bacro  for  ESIH. 


inputs  to  the  circuit  five  clock  cycles  prior.  The  results 
of  this  logic  simulation  are  contained  in  Appendix  B. 

The  serial  shifting  of  the  latches  has  simulated  and 
used  to  generate  the  intermediate  results  discussed  in  the 
next  chapter.  This  also  proved  to  logically  operate  as 
expected,  thus  giving  the  designer  a  high  degree  of  con£i> 
dence  that  the  circuit  operates  as  desired. 

2 .  Timing 

The  CRYSTAL  VIST  timing  analyzer  is  used  to  test  for 
the  worst  case  propagation  delay  in  the  circuit.  Each  phase 
of  the  clock  in  both  a  normal  and  shifting  operation  is 
checked  for  a  critical  path  that  is  defined  to  be  within  one 
percent  of  the  worst  case  propagation  delay.  These  critical 
paths  determine  the  saximum  clock  speed  at  which  the  circuit 
can  properly  operate.  The  worst  delays  found  are  discussed 
for  each  phase  of  the  clock. 
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Gn  the  risiag  edge  of  an  externally  applied  phi1 , 
the  longest  propagation  delay  occurs  from  the  input  pads 
until  the  data  is  stored  in  the  first  inverter  of  the  stage 
1  latch.  This  delay  is  found  to  be  558.82  nanoseconds. 
This  long  delay  can  he  attributed  to  the  two  high  fanouts 
that  occur  in  the  data  path  of  the  first  stage.  The  first 
is  a  fanout  of  sixteen  that  occurs  at  each  input  pad  to  the 
input  of  the  sixteen  HAND  gates  used  as  1x1  multipliers. 
The  second  is  a  fanout  of  fourteen  that  occurs  at  the  end  of 
the  first  stage  where  the  full  adder  cells  that  correspond 
to  the  extended  sign  bits  are  distributed  to  drive  full 
adders  in  the  second  stage  . 

Uhen  phil  falls,  it  takes  89.11  nanoseconds  for  the 
latch  cells  to  turn  of  their  input  pass  transistors  and 
isolate  the  data  so  it  may  be  transferred  during  phi2.  This 
fall  time  corresponds  to  the  separation  time  between  phil 
and  phi2  when  both  deck  phases  are  low. 

Cnee  a  rising  clock  edge  is  applied  to  phi2,  it 
takes  98.26  nanosecends  for  the  pass  transistors  in  the 
latch  cells  to  turn  cn  and  charge  the  second  inverter.  To 
complete  the  transfer  of  data,  these  pass  transistors  must 
be  disabled  by  the  falling  of  phi2.  This  corresponds  to  the 
minimum  separation  between  the  phi2  and  phil  clock  phases 
and  is  found  to  be  64.28  nanoseconds. 

Figure  3.11  depicts  the  minimum  clock  cycle  for  the 
16-bit  multiplier  as  determined  by  CBTSTAl.  This  eguates  to 
a  maximum  overall  clock  freguency  of  1.234  HHz.  The  results 
of  the  CEYSTAL  timing  analysis  are  contained  in  Appendix  E. 

3 •  tower  Consumption 

rc  power  reguirements  for  the  16-bit  multiplier  are 
determined  through  the  use  of  the  CAD  program  POWEST. 
P0BE5T  looks  for  pullup  transistors  and  determines  a  total 
count  of  these  devices.  Using  a  reference  power  consumption 
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Figure  3.11  Biniaae  Clock  Cycle  Paraaeters 


for  pullup  transistors  of  certain  sizes  and  types,  it 
obtains  a  aaxiaua  estiaate  of  power  consuaed  by  assuaing  all 
pullups  are  on  at  the  saae  tiae.  The  ayerage  power  consuap* 
tion  is  determined  by  assuaing  that  only  half  of  the  pullups 
are  on  at  a  given  tiae. 

For  the  16-bit  aultiplier,  the  aa  '  aua  DC  power 
consuaption  is  found  to  be  3.177  Hatts  with  a  average  power 
consuaed  of  1.983  Ratts.  The  results  of  the  .OREST  siaula- 


17.  TBST  PLAH 


As  stated  earlier,  the  use  of  the  logic  simulator  ESlH, 
the  CSYSIAL  timing  analyzer,  and  POVEST  will  give  the 
designer  a  high  degree  of  confidence  that  the  circuit 
designed  mill  perform  as  desired.  Once  the  circuit  has  been 
fabricated  and  received  from  HOSTS,  it  must  be  tested  to 
insure  that  fabrication  and/or  bonding  errors  did  not  occur. 
Preliminary  work  done  by  Carlson  on  a  16-bit  pipelined 
multiplier  indicates  that  errors  in  fabrication  and/or 
bonding  do  actually  occur.  In  this  chapter,  a  test  plan  for 
the  verification  of  pover  consumption,  correct  logical  oper¬ 
ation,  and  maximum  speed  of  operation  is  presented. 

A.  irEHTIPYIlG  IHPOY  AHO  OOTPET  PINS 

After  fabrication,  the  chip  will  come  back  packaged  in 
an  84  pin  sguare  grid  package  with  21  pins  on  each  side. 
Since  only  77  pins  are  used  in  the  32-bit  multiplier,  it  is 
imperative  that  the  pin  to  pad  connections  are  accurately 
known.  To  do  this,  one  must  properly  orient  the  chip. 
Close  examination  of  the  chip  will  reveal  the  logo  ”GC  AEHY” 
located  between  the  6ND  and  7dd  rails  that  run  around  the 
perimeter  of  the  chip.  Place  this  logo  in  the  southeast 
corner  as  shown  in  Figure  4.1.  Using  this  logo  as  a  land¬ 
mark,  proceed  clockwise  around  the  chip  starting  on  the 
southern  edge. 

Along  the  southern  edge  are  twenty-one  output  pads  that 
are  used  for  a  portion  of  the  product.  Representing  the 
product  as  p31...p0  where  pO  is  the  least  significant  bit, 
the  southern  edge  contains  signals  p6  through  p26  as  one 
moves  frcm  east  to  vest.  The  western  edge  is  made  up  of 


five  Qutfut  pads  and  twelve  input  pads.  Moving  frcn  south 
to  north,  the  first  five  pads  are  p27  through  p31.  Ihe  next 
pad  is  the  phi2  clock  input  followed  by  the  four  latch 
serial  inputs  for  latch  4  through  latch  1.  Then  cooes  the 
Vdd  pad  followed  by  the  six  oost  significant  bits  of  the 
oultiplier  a15  through  a10.  Moving  vest  to  east  along  the 
northern  edge,  the  lenainder  of  the  oultiplier  inputs  a 9 
through  aO  and  the  eleven  inputs  of  the  oultiplicand  t15 
through  15  are  encountered.  Along  the  eastern  edge  going 
froo  north  to  south,  the  reoainder  of  the  oultiplicand  pads 
b4  through  bO  are  found  followed  by  the  GHD  pad.  Next  are 
the  fcur  latch  serial  outputs  for  latch  1  through  latch  4. 
Next  are  the  OP  and  phil  inputs  which  are  followed  by  the 
lover  six  hits  of  the  product  vector  pO  through  p5.  This 
should  ccoplete  the  circuit  around  the  chip  and  leave  one 
back  at  the  logo.  Extreme  care  must  be  exercised  when 
tracing  the  fine  wires  froo  the  bonding  pads  to  the  pins, 
especially  along  the  east  and  west  edges  where  the  number  of 
pins  is  greater  than  the  nunber  of  bonding  pads. 

To  power  the  chip  +5  volts  DC  should  be  applied  tc  the 
Vdd  pad  and  0  volts  tc  the  GND  pad.  All  inputs  should  use 
Vdd  to  represent  a  logic  1  and  GND  for  a  logic  0.  The 
outputs  use  the  sane  levels  as  the  inputs  to  represent  the 
two  logic  levels.  To  measure  the  outputs,  they  should  be 
connected  to  a  device  with  a  high  input  impedance. 
According  to  the  documentation  for  the  pads,  the  output  pads 
are  designed  to  drive  approximately  two  TIL  loads,  but  may 
require  a  pull  up  resistor  to  obtain  a  full  Vdd  output  level. 

6.  FCNES  COHSUHFTION 

The  simplest  ox  the  three  tests  to  perform  is  to  check 
the  static  DC  power  ccnsumption  of  the  circuit.  Once  input, 
output,  and  supply  pins  are  properly  connected,  this  can  be 


accoBflisJied  by  inserting  a  ailliaBiieter  into  the  Vdd  supply 
line  and  neasuring  the  nuber  of  anperes  the  circuit  is 
drawing.  This  value  aultiplied  by  the  *^5  volts  of  the  power 
supply  will  give  an  approxiaate  average  DC  power  consump¬ 
tion.  This  figure  should  be  in  the  vicinity  of  the  1.983 
natts  predicted  by  PGiEST. 

C.  TZSTIIG  FOR  LOGICIL  OPBBITIOR 

Since  exhaustive  testing  of  the  32-bit  multiplier  is 
virtually  impossible,  the  same  seven  test  vectors  that  were 
used  in  ESIH  should  be  utilised  to  verify  correct  operation. 
In  addition,  other  random  vector  pairs  should  be  tested  foe 
correct  operation  in  the  circuit.  At  this  point,  speed  of 
operation  is  not  a  concern  and  the  clocJc  frequency  should  be 
reduced  by  a  magnitude  of  approximately  ten  from  that 
predicted  by  CSTSIAI.  This  will  insure  that  propagation 
delays  dc  not  becese  a  factor  in  determining  logical 
correctness. 

First,  the  vector  pairs  should  be  applied  one  at  a  time 
and  a  minimum  of  five  clock  cycles  completed  with  OP  at  a 
logic  1.  At  the  end  of  the  fifth  clock  cycle,  the  output 
should  represent  the  correct  product  for  the  input  pair. 
This  will  at  least  insure  that  the  chip  performs  a  32-bit 
two's  ccmplement  multiplication.  This  should  be  done  for 
each  cf  the  seven  test  vector  pairs  that  were  used  in  ESIM. 
Next,  each  of  the  seven  test  vector  pairs  should  be  applied 
every  deck  cycle.  After  a  delay  of  five  clock  cycles,  the 
correct  results  should  appear  at  the  output  during  phi2  of 
each  cycle  of  the  clock.  This  establishes  the  fact  that  the 
chip  can  multiply  in  a  pipelined  manner. 

To  determine  if  the  latches  can  serially  operate  as 
designed,  known  sequences  should  be  applied  at  the  inputs 
with  the  OP  pin  at  a  logic  0.  Since  the  latches  that  are 


otttpat  to  the  foar  latch  output  pads  are  all  of  different 
lengths,  the  output  of  this  operation  will  occur  at 
different  times  for  each  pin.  For  latch  1,  latch  2,  latch  3 
and  latch  4,  the  input  sequence  will  start  appearing  at  the 
appropriate  output  pin  after  143,  57,  70  and  64  clock 
cycles,  respectively. 

If  any  of  the  test  vectors  fail,  the  intermediate  latch 
results  of  each  vector  pair  can  be  shifted  to  an  output  pin 
for  examination.  Ihis  can  provide  an  excellent  aid  in 
locating  circuit  faults.  The  intermediate  latch  values  and 
the  final  product  outputs  for  each  of  the  seven  test  vector 
pairs  are  found  in  Appendix  C. 

D.  TESTIHG  FOS  fllllBOfl  SPEED 

The  third  and  final  test  to  be  performed  on  the  chips 
that  pass  the  logic  function  testing  is  to  determine  the 
maximum  frequency  at  which  they  will  operate  correctly.  To 
accomplish  this,  the  duration  of  the  time  that  phil  and  phi2 
are  high  and  the  two  interphase  times  when  phil  and  phi2  are 
low  should  be  separately  reduced  until  an  incorrect  product 
is  generated.  This  should  be  done  with  each  of  the  seven 
test  vectors  until  a  minimum  time  is  found  for  each  of  these 
four  clock  parameters.  Then  the  worst  case  for  each  of 
these  parameters  over  all  seven  test  vectors  can  be  called 
the  minimum  clock  parameters  for  the  32-bit  multiplier.  The 
maximum  overall  clock  frequency  for  the  chip  is  then  just 
the  reciprocal  of  the  sum  of  the  four  minimum  deck 
parameters. 
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V.  CUSICH  vs.  sill C OH  COMPllEB  DBSIGH 


One  of  the  aain  advantages  of  using  a  silicon  compiler 
is  that  it  provides  an  extremely  fast  transition  time  from 
I  the  initial  architecture  to  the  final  layout  of  the  design. 

This  author  estimates  that  the  total  time  to  actually 
generate  the  design  of  the  8-bit  multiplier  by  Carlson 
[fief.  2]  using  the  llacPitts  silicon  compiler  vas  less  than 
I  24  man-hours.  Theoretically,  at  the  end  of  this  time,  a 

functionally  correct  layout  is  generated.  Later  vcrk  done 
by  Proede  [fief.  11]  on  this  compiler  has  proven  that 
fiacPitts  does  not  always  generate  a  correct  layout.  In 
comparison,  the  time  consumed  in  the  design  of  the  16- bit 
multiplier  presented  in  this  thesis  is  estimated  at  over  750 
man-hours. 

This  design  turn-around  time  advantage  of  using  a 
I  silicon  compiler  for  chip  generation  allows  the  designer  a 

great  degree  of  freedom  to  explore  possible  different  archi¬ 
tectures  to  solve  a  problem  and  actually  see  the  results  in 
silicon.  This  freedom  is  not  enjoyed  by  the  full  custom 
designer  whose  architecture  must  be  thoroughly  researched 
and  optimized  prior  to  the  layout  of  the  actual  chip.  If 
this  is  not  the  case,  a  tremendous  loss  of  valuable  man¬ 
hours  occurs  when  the  redesign  of  a  chip's  basic  architec¬ 
ture  must  be  undertaXen. 

The  use  of  a  silicon  compiler  is  not  without  its  disad¬ 
vantages  though.  Three  of  the  main  areas  that  a  siliccn 
compiler  generated  chip  is  at  a  disadvantage  are: 

1.  density  of  transistors. 

2.  speed  of  operation. 

3.  power  consumption  per  transistor. 
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To  Bake  a  specific  coaparison,  an  8-bit  nultiplier 
generated  by  the  dacPitts  silicon  compiler  available  at  MPS 
was  compared  with  the  full  custom  multiplier  of  this  thesis. 
The  following  sections  discuss  -the  three  main  areas  listed 
above.  They  are  preceeded  by  a  discussion  of  the  two 
circuit  architectures  that  are  to  be  compared. 

A.  lUICTIOHAL  AfiCHITiCTOBE 

The  architecture  of  the  16-bit  multiplier  has  already 
been  thoroughly  presented  in  the  previous  two  chapters.  In 
summary,  the  chip  performs  a  16-bit  two's  complement  pipe¬ 
lined  multiplication  on  16-bit  operands  with  a  latency  of 
five  cycles  of  a  two  phase  clock.  The  circuitry  for  this 
chip  is  designed  using  a  minimum  feature  size  of  3.0  micrcns 
and  is  wholly  contained  on  one  integrated  circuit. 

The  multiplier  generated  by  the  HacPitts  silicon 
compiler  performs  an  8-bit  multiplication  on  unsigned  8-tit 
operands  with  a  latency  of  eight  cycles  of  a  three  phase, 
five  segnent  clock.  It  uses  the  basic  add-and-shift  algo¬ 
rithm  for  the  basis  cf  its  architecture.  Due  to  the  limita¬ 
tions  in  chip  dimensions,  pin  count,  and  minimum  feature 
size  imposed  by  HOSIS  at  the  time  the  chip  was  fabricated, 
this  chip  was  designed  with  a  minimum  feature  size  of  4.0 
microns.  It  reguires  the  cascading  of  two  identical  inte¬ 
grated  circuits  to  perform  an  8-bit  multiplication. 

Additionally,  the  16-bit  multiplier  employs  a  ISSD  tech- 
nigue  that  allows  the  contents  of  each  of  the  four  interme¬ 
diate  latches  to  be  serially  examined  to  aid  in  the 
detection  of  circuit  fabrication  errors.  The  MacPitts 
multiplier  does  not  employ  this  technigue  and  determing 
fabrication  and/or  design  errors  is  extremely  difficult,  if 
not  impossible,  to  perform  by  examining  just  the  chip 
outputs.  A  LSSD  technigue  could  possibly  have  been  included 


in  the  HacPitts  design,  bat  if  included  the  aaxiaaa  chip 
area  defined  by  HOSIS  nay  have  been  exceeded. 

B.  CHIP  AREA  AID  DEBSIIY 

Since  both  VLSI  circuits  are  designed  with  different 
ainiaua  feature  size,  to  provide  a  fair  basis  for  conparison 
of  the  tNo  designs  the  16-bit  multiplier  is  nornalized  to  a 
U.O  micron  feature  size.  Figure  5.1  shows  the  resultant 
.log  file  from  the  MEXTRA  node  extractor  for  both  the  8-bit 
and  16-bit  multipliers.  This  file  contains  the  chip  dimen¬ 
sions  in  microns  and  the  number  of  transistors  in  the 
circuit. 


Window:  0  676600  0  602400 
801  depletion 
1612  enhancement 
1398  nodes 


Macpitts  8-bit  Multiplier. 


Window:  -600  919350  -600  789300 
3914  depletion 
1 1962  enhancement 
8503  nodes 

Custom  16-bit  Multiplier. 


Figure  5.1  HEXTRA  .log  Output. 

The  size  shown  in  Figure  5.1  for  the  16-bit  multiplier 
is  based  on  a  1.5  minimum  feature  size.  This  results  in 


chip  diiessions  of  S199.50  by  7899.0  nicroAs.  By  carcent 
HOSTS  liiitations,  the  naxiaua  chip  diaensions  are  9200.0  by 
7900.0  aicrons.  Therefore,  at  laabda  equal  1.5  microns  the 
overall  design  is  aithin  one  aicron  or  less  of  the  mazimum 
allowed  by  HOSTS.  Normalizing  the  circuit  dimensions  to  a 
4.0  micron  minimum  feature  size,  the  16-bit  multiplier 
consumes  an  area  12,260.0  by  10,532.0  microns.  By  compar¬ 
ison,  the  HacPitts  generated  8-bit  multiplier  occupies  an 
area  6766.0  by  6024.0  aicrons.  The  HacPitts  chip  consumes 
approximately  one- third  of  the  area  of  the  hand-crafted 
multiplier. 

The  ether  main  point  of  interest  that  deals  with  the 
physical  characteristics  of  the  chip  is  its  transistor 
density  or  number  of  transistors  per  square  micron.  Fcr  the 
normalized  16-bit  multiplier.  Figure  5. 1  shows  a  total  of 
15,876  transistors.  This  yields  a  transistor  density  of 
1.23  X  10-*  transistors  per  square  micron.  For  the  HacPitts 
multiplier,  the  HETIBA  node  extraction  found  a  total  of 
2,413  transistors.  Ihis  gives  a  transistor  density  of  5.92 
x10~3  transistors  per  square  micron.  One  interesting  point 
to  note  is  that  the  HacPitts  compiler  found  eighty-four  more 
transistors  on  the  8-bit  multiplier  than  the  HEXTBA  node 
extractor  did  £Be£.  2].  One  possible  explanation  for  this 
difference  is  that  EaePitts  generates  some  unusual  tran¬ 
sistor  structures  that  were  unrecognizable  by  HEXTEA. 

C.  PCNEB  COHSUHPTTON 

One  area  that  is  becoming  more  and  more  important  with 
the  increasing  number  of  transistors  per  chip  that  is  being 
created  by  improved  technology  is  the  static  DC  power  dissi¬ 
pation  of  a  VLSI  circuit.  For  the  purposes  of  providing 
comparisons,  the  CAO  pregram  PCREST  is  used  as  the  basis  for 
reference. 


For  the  16*bit  aaltiplier,  the  average  DC  power  consuap' 
tion  is  found  to  be  1.983  iatts  with  a  maziauii  power  usage 
of  3.177  Watts.  Dsing  POWEST  on  the  8-bit  oultiplier 
yielded  an  average  DC  power  coosusption  of  0.352  Watts  and  a 
■axiauB  power  usage  of  0.667  Watts.  Appendix  B  contains  the 
results  of  the  POWEST  runs  on  both  of  the  designs.  The 
HacPitts  silicon  coapiler  also  outputs  an  estiaate  it  aaJces 
of  the  laximuB  power  consuaed  by  a  circuit.  For  the  8-tit 
Bultiplier,  this  value  is  0.407  Watts.  This  value  is  over 
thirty-five  percent  less  than  the  POWSST  aaximua  value. 

One  way  to  possibly  coapare  the  power  consuapticn  for 
the  two  designs  is  to  deteraine  a  power  consuaed  per  tran¬ 
sistor  figure.  Using  the  aaxiaua  POWEST  values  for  both 
designs  yields  2.00  x  10~^  Watts  per  transistor  for  the 
16-bit  aultiplier  and  2.77  x  10~*  Watts  per  transistor  for 
the  8-bit  aultiplier.  The  difference  between  these  two 
figures  can  be  priaarily  attributed  to  the  following.  The 
HacPitts  aultiplier  uses  nine  two  input  HAND  gates  to 
generate  the  full  adders  used  in  each  stage.  The  custoa 
aultiplier  uses  a  selector  adder  coaposed  priaarily  of  pass 
transistors  which  consume  no  DC  static  power.  This  results 
in  an  overall  lower  power  consumption  per  transistor  for  the 
16-bit  aultiplier  when  compared  to  the  8-bit  multiplier. 


D.  SPEED  OF  OPEBATICl 


As  discussed  earlier,  CEYSTAL  determined  that  the 
aaziauB  clock  frequency  for  the  16-bit  multiplier  is  1.234 
HHz.  HacPitts  generated  designs  use  a  different  clocking 
scheme  than  the  two  phase,  non-overlapping  clock  presented 
by  Head  and  Conway  £ Bef .  1:  p.  65].  It  uses  a  three  phase, 
five  segment  overlapping  clock  to  generate  the  control 
signals  for  each  latch  in  the  pipeline.  For  a  full  discus¬ 
sion  cf  the  HacPitts  clocking  scheme  and  how  to  use  the 


CSIS![AL  tiling  analyzer  on  a  MacPitts  design,  the  reader  is 
referred  to  work  done  hy  Froede  [Ref.  11].  The  tiling  anal¬ 
ysis  vas  perforaed  on  the  HacFitts  lultiplier  in  accordance 
vith  this  docaient  and  the  vorst-case  CS7SIAL  tiling  results 
are  cortained  in  Appendix  B. 

The  overall  liniiui  clock  period  for  a  CBYSTAL  design  is 
found  hy  adding  the  worst  stage  propogation  delay  that 
occurs  during  the  first  two  segments  of  the  clock  to  the 
last  three  clock  segment  delays.  For  the  8-bit  multiplier, 
the  longest  stage  is  the  first.  The  critical  path  is  found 
to  run  from  the  input  pads,  through  the  Weinberger  array, 
and  then  through  eight  full  adders  cascaded  in  series  to 
perform  one  summation  of  the  partial  products  in  the  add- 
and-shift  algorithm.  This  delay  was  found  to  be  4838.89 
nanoseconds.  The  sum  of  the  individual  times  for  the  clock 
signals  tc  travel  from  the  input  pads  to  the  latch  cells 
during  the  last  three  segments  of  the  clock  is  207.14  nano¬ 
seconds.  This  results  in  an  overall  minimum  clock  period  of 
5046.03  nanoseconds  and  a  laximum  clock  freguency  of  198.176 
KHz.  The  high  propogation  time  in  the  first  stage  of  the 
circuit  is  due  primarily  to  three  things.  First,  high 
resistance  polysilicon  is  utilized  for  the  long  data  runs. 
Second,  no  signals  are  buffered  in  any  way  to  provide  an 
imprcved  signal  sourcing  capability  to  help  combat  the  high 
fanouts  and  long  data  runs.  Third,  an  8-bit  ripple  carry 
adder  is  utilized  to  sum  two  partial  products  in  every  stage 
of  the  pipeline.  Each  1-bit  full  adder  in  an  8-bit  ripple 
carry  adder  is  composed  of  nine  HAND  gates.  The  carry  in 
between  each  full  adder  in  the  ripple  carry  adder  is  not 
routed  directly,  but  is  routed  over  a  long  polysilicon  wire 
which  also  contributes  to  the  high  critical  path  delay. 


E.  SQHHABT 


latle  III  sumaarizes  the  results  for  the  coifariscn  of 
the  hard'crafted  design  and  its  silicon  compiler  generated 
counterpart.  The  results  are  as  expected  with  the  custom 
design  having  a  six-fold  increase  in  maximum  speed,  a 
thirty-eight  percent  decrease  in  power  consumption  per  tran¬ 
sistor,  and  a  doubling  of  chip  density  over  the  HacPitts 
design.  The  true  advantage  of  the  HacPitts  silicon  compiler 
is  in  its  ability  tc  provide  extremely  rapid  design  turn¬ 
around  time  versus  a  hand-crafted  design.  As  research 
continues  into  the  area  of  silicon  compilation  and  improve¬ 
ments  are  made  to  existing  compilers,  they  may  someday 
become  the  powerful  and  useful  tool  that  they  have  the 
potential  to  be. 


TABLE  III 

Summary  cf  Comparison  Statistics 
PAE^METER  CDSTOM  MULT  HACPITTS  HUIT 


SX2E  CF 

OPEBAHO  INPUTS 

DIfEKSICNS 
(mere  ns) 


DEBSliy 

(transistors/mxcrcn^) 


STATIC  DC  POWER 
(Watts) 

POWEST 

AVERAGE 

HAXIHUn 

HACPITTS 

HAXIHUii 

POKER/TRANSISIOR 

(Watts) 

HAXIHOn  FREQUENCY 
(REZ) 

DESIGN  TIHE 
(man-hours) 


16  bits 
12266  X  10532 
1.23X10-* 


1.983 
3.  177 


NA 


2. 00x10-* 

1234.0 

750 


8  bits 
6766  X  6024 
5. 92x10-5 


0.352 

0.667 

0.407 

2.  766x10-* 
198.  176 


24 


56 


I 
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Is  thxs  thesxSf  the  applxcatxon  of  carcy-save  addxtxon 
to  a  16-hxt  tvo*s  ccaplement  ■ultiplication  and  its  iople- 
■entatxon  as  a  pipelined  7LSI  design  have  been  presented.  A 
compariscn  betveen  this  hand-crafted  design  and  an  8-bit 
unsigned  aultiplier  was  developed.  This  coaparison  coupled 
vith  the  experience  gained  in  the  actual  design  and  computer 
simulaticn  of  the  aultiplier  leads  to  the  following  conclu¬ 
sions  and  recoamendations. 


A.  DESIGN  OF  THE  BniTIPlIEB 

If  the  design  of  the  aultiplier  were  to  be  undertaken 
again,  three  changes  to  the  circuit  would  be  desirable. 
First,  the  incorporation  of  a  static  latch  would  be 
attempted  provided  a  feasible  design  that  would  fit  into  the 
Halted  available  chip  area  could  be  developed.  A  static 
latch  would  insure  that  data  remains  valid  and  not  be 
discharged  from  the  inverter *s  gate  capacitance  if  too  slow 
a  clock  is  applied.  Second,  the  high  fanout  from  the  latch 
control  drivers  would  be  divided  into  a  tree  structure.  At 
its  termination  points  would  be  smaller,  more  efficient 
drivers  that  would  drive  a  fanout  not  greater  than  five. 
Third,  improvements  to  the  buffering  of  the  high  fanout  sign 
extended  bits  of  the  first  stage  and  the  outputs  of  certain 
1x1  multipliers  would  be  accomplished.  Both  of  the  last  two 
improveaents  would  be  directed  at  optimizing  the  maximum 
clock  freguency  of  the  aultiplier. 

Another  possible  solution  to  the  long  propagation  delay 
through  the  first  stage  is  to  partition  the  stage  into  two 
stages  with  approximately  equal  delay.  Although  this  would 


reduce  the  fropagation  delay  through  the  first  stage,  the 
increase  in  routing  cooplezity  and  area  required  for  an 
additional  204-bit  latch  nay  not  be  feasible  in  current 
HOSTS  liaitations. 

The  ISSD  technique  is  highly  recommended  to  he  affiled 
to  any  pipelined  design  so  that  the  testing  and  detection  of 
fabrication  errors  is  made  easier.  Hot  only  vill  the  LSSD 
technique  prove  beneficial  in  the  after-fabrication  testing, 
but  it  also  proved  eitrenely  useful  in  CAD  simulation  before 
fabrication  to  detect  routing  errors.  The  value  of  imple¬ 
menting  a  LSSD  in  most  cases  will  far  outweigh  the  increased 
complexity  of  the  latch  design  and  the  potential  frustration 
in  searching  for  errors  based  on  final  latch  outputs. 

A  32-bit  CLA  adder  could  he  developed  to  complement  the 
16-bit  multiplier.  This  can  be  accomplished  very  rapidly 
and  with  little  additional  effort  by  using  the  same  method 
described  in  this  thesis  with  the  following  exception. 
Since  the  carry  in  to  an  adder  is  not  necessarily  zero,  the 
equations  actually  input  to  EQNTOTT  and  TPLA  should  be 
Equations  2.3  through  2.8  and  Equations  2.13  through  2.15. 
Additionally,  the  use  of  full  32-bit  operands  will  require 
the  expansion  of  all  of  the  latches. 

E.  CAD  BIBDHABE  AHO  SOFTWABE 

The  combination  of  EQHTOTT  AND  TPLA  proved  to  be  a  very 
useful  pair  of  CAD  tools  in  the  development  of  complex  logic 
functions.  Additionally,  TPLA  appears  extremely  versatile 
with  the  different  technologies  available  and  its  numerous 
options. 

CAESAS  proved  to  be  a  very  good  design  tool  for  the 
graphical  layout  of  a  VLSI  design.  The  installation  of  its 
successor,  the  layout  editor  HAGIC,  should  greatly  ease  the 
routing  burden  of  the  designer. 


Ihe  coming  addition  of  hardware  to  support  actual 
testing  of  chips  that  have  been  fabricated  by  MOSIS  will 
greatly  aid  in  deteriining  the  accuracy  of  available  CAD 
siaulaticn  tools.  Once  these  in-house  testing  capabilities 
are  available,  extensive  testing  should  be  accomplished  in 
the  two  multipliers  discussed  here.  In  particular,  a 
detailed  comparison  should  be  made  between  CAD  simulation 
and  actual  results  in  the  areas  of  functional  operation, 
oazifflum  speed,  and  static  DC  power  consumption. 

C.  SILICCI  C0HPI1.ATIC1 

Even  though  the  MacPitts  program  available  at  NFS  by  no 
means  provides  an  optimum  integrated  circuit  design,  it  is 
an  excellent  vehicle  from  which  to  study  the  area  of  silicon 
compilers.  They  provide  an  excellent  alternative  to  the 
custom,  gate  array,  and  standard  cell  interconnection 
methods  that  are  in  tse  today.  Further  research  into  opti¬ 
mizing  the  existing  MacPitts  silicon  compiler  for  speed, 
power  consumption,  and  transistor  density  should  be 
undertaken. 
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miHJiI  4 

STIPPLE  PLOTS 


On  the  following  (ages  are  the  stipple  plots  of  the  four 
basic  building  blocks  that  were  used  in  the  design  of  the 
16-bit  Bultipliec.  fcllowing  these  is  a  stipple  plot  of  the 
final  layout  for  the  16-bit  two's  conplement  multiplier  that 
was  designed  for  this  thesis.  For  the  purpose  of  clarity 
and  continuity,  a  stipple  plot  of  the  8-bit  multiplier 
generated  by  the  iiacPitts  silicon  compiler  is  also 
presented.  All  plcts  were  made  with  the  CAD  pregram 
CIFPICT. 


Hand-crafted  16-Bit  Multiplier 


HacPitts  8-Bit  Multiplier. 


i££Uifi22  B 

SIHQL&TIOH  BESUITS 


Ihe  following  pages  in  this  appendix  contain,  in  order 
the  resultant  ESIM  and  CRTS![AI  session  for  the  8-bit  aulti- 
plier,  the  CfiTSTAl  tioing  analysis  for  the  8- bit  Dultiplier, 
and  the  POWEST  estiaates  for  both  the  16-bit  and  8-bi' 
aultipliers. 


ESI21  results  for  16~bit  two*s  conpleaent  oultiplier 


%  Mini  muU32.3im 

11962  transistors,  8452  nodes  (3914  pulled  up) 

sim>  Q  inii_Miin 
initialization  look  33772  steps 
initialization  took  4682  steps 
initialization  took  230  steps 
initialization  took  0  steps 
initialization  took  0  steps 
step  took  6  events 
laichout  =0000  0 

piow---innii]iiiinn  65535 

phigh=llllllllll]lllll  65535 
latchin=0000  0 
bin =0000000000000000  0 

ain=0000000000no0000  0 

op=l 


sim>  R  5 
sim>  V 

latchout=0000  0 

plow =0000000000000000  0 

phigh=0000000000000000  0 

latrhin=0000  0 

bin  =0000000000000000  0 

ain  =0000000000000000  0 

op=l 

h  inputs:  Vdd  op 
I  inputs:  GND  phil  phi2 

sim>  ®  test_yectorl 
step  took  451  events 
latchout=0000  0 
plow  =0000000000000000  0 

phigh=0000000000000()00  0 
laichin=0000  0 
bin=0000000010001111  143 

ain=0000000000011011  27 


Utfh""«ui  OfMMJ  ft 
plow  ftOftOOftftOOOitftOOOO  0 
phigh  =0000(j000000ft0000  0 
laichin=0000  0 
bin-^ftCKMOOOOlOftftllll  143 
ain -0(MK)000000011011  27 

op-1 

eye  If  took  3785  events 

sim>  Q  test_yector2 
step  look  1927  events 
latehom-0000  0 
plow— OOOOftftOOOOOOOOOO  0 
phigh =0000000000000000  0 
laiehin=0000  0 
bin=llllllll01110001  63393 
ain =000000000001 1011  27 

op=l 
sim>  c 

latehout=0000  0 
plow=0000000000000000  0 

phigh=0000000000000000  0 
latchin=0000  0 
bin=1111111101110001  65393 

ain=0000000000011011  27 

op=l 

eyele  took  4888  events 

sim>  ®  iesl_yeetor3 
step  took  2819  events 
latchout=0000  0 
plow  =0000000000000000  0 

phigh=Oft(KMM)0000000000  0 
latchin  =  0000  0 
bin=0000000010001111  143 

ain  =  1111111111100101  65509 

op=l 
sitn>  e 

latchout=0000  0 
plow  (K)OftO(M»OOflOOOOOO  0 
phigh--00(KKK)000(X)00000  0 
latchin=0000  0 
bin=0000000010001111  143 

ain  =  Illllllllll00101  65509 
op=  1 

eyele  took  5243  events 


'■m  ®  •osi_yector4 

'•top  to«»k  4777  events 

iatrhout— 0000  0 

plow  0000000000000000  0 

phigh  =0000000000000000  0 

latchin=OOUO  0 

bin  =  ltlIltI10in000]  6589 

ain  =  lllllll  II 1100101  6550< 

op=  1 
sim>  c 

latchout=0000  0 

plow =0000000000000000  0 

phigh -0000000000000000  0 

iatchin=0000  0 

bin  =  I111111101110001  6539; 

ain=lllllllllll00l01  6550! 

op=l 

cycle  look  4821  events 

siin>  ©  test_yector5 
step  took  3403  events 
latchout=00U0  0 
plow=0000000000000000  0 
phigh=0000000000000000  0 
latchin=0000  0 
bin  =0000010001 100011  1123 

ain=0000001]01111011  891 

op-  1 
sini>  c 

laichout=0000  0 
plow=0000111l00010101  3861 
phigh=0000000000000000  0 
latchin=0000  0 
bin -0000010001 100011  1123 

ain  =  0000001101111011  891 

op=l 

cycle  took  5981  events 


Sim  6  tes»  _\’ector6 

sipp. took  2121  events 

lair  houi —0000  0 

plow- 0000111100010101  S861 

phigh^  0000000000000000  0 

laifhin=0000  0 

bin=  0000001 101 11 1011  891 

ainr 1111101110011101  64413 

op=l 

sim>  c 

laichout=0000  0 

plow  =  111  100001 1101011  61675 

phigh^llllllllllllllll  65535 

lalchin=0000  0  . 

bin^OOnOOOIlOllllOll  891 

ain= 11 11 1011 1001 1101  64413 

op=l 

rycle  took  5341  events 

sim>  G  test_yector7 
step  took  1708  events 


latchout=0000  0 
plow=11110000111010Il  61675 
phigh=llll]lllllllllll  65535 
latchin=0000  0 
bin  - 1000000000000000  32768 

ain  =  1000000000000000  32768 

op—  1 
sirn>  c 

latchout=0000  0 

plow=l  11100001 1101011  61675 

phigh=llllllllllllllll  65535 

latchin=0000  0 

bin  ^  1 000000000000000  32768 

ain  - 1000000000000000  32768 

op-  1 

cycle  took  5084  events 
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]at.choui=0000  0 

p]ow  =  00001 11 100010101  3861 

phigh  ^0000000000000000  0 

latchin=0n00  0 

bin -1000000000000000  32768 

ain=  1000000000000000  32768 

op=l 

cycJf  took  4786  events 
sim>  c 

latchout=0000  0 
plow=01000l00100]0001  17553 

phigh=000000000000llll  15 
latchin=0000  0 
bin -1000000000000000  32768 

ain=  1000000000000000  32768 

op-  I 

cycle  took  4170  events 
sim>  c 

latchout=0000  0 
plow^lOlllOllOllOini  47983 
phighsllllllllllllOOOO  65520 
latchin=0000  0 
bin  =  1000000000000000  32768 

ain  =  1000000000000000  32768 

op=l 

cycle  took  4280  events 
sim>  c 

latchout=0000  0 
plow-OOOOOOOOOOOOOOOO  0 
phigh=0100000000000000  16384 
latchin=0000  0 
bin  =1000000000000000  32768 

ain  =  1 000000000000000  32768 

op=l 

cycle  look  3953  events 


CfiYSTAI  results  for  16-bi.t  tvo's  cospleaent  sultiplier 

Crystal,  v.2 
;  build  muhS2.sim 
[l;12.1u  0:12.4s  1786k| 

:  inputs  a<lS:0>  b<15:0>  op  phil  phi2 
[0:00.  lu  0:00.1s  1795k| 

;  inputs  ll_in  12_in  lS_in  I4_in 
|0;00.0u  0:00.0s  1795kj 

:  outputs  p<31:0>  II _out  l2_out  IS_out  l4_put 
[0:00.0u  0:00.0s  179Sk] 

:  markdynamic  phil  0  phi2  0 
Marking  transistor  flow... 

Setting  Vdd  to  1... 

Setting  GND  to  0... 
j0.08.lu  0:01. Is  1795kj 

RISETIME  FOR  PHI2  IN  NORMAL  OP  *** 

:  set  1  op 

|0:00.5u  0:00.1s  1795kj 
:  set  0  phil 

|0:00.7u  0:00.1s  1795k] 

:  delay  phi2  0  -1 
(12279  stages  examined.) 

|<).46.8u  0.04.6s  I8SSk| 

:  critical  Im 

Node  14171  is  driven  high  at  98.26ns 

...through  fet  at  (2772,  1751)  to  Vdd  after 
16259  is  driven  low  at  95.79ns 
...through  fet  at  (2792,  1810)  to  GND  after 
16968  is  driven  high  at  92.08ns 
...through  fet  at  (2800,  1819)  to  17829 
...through  fet  at  (2794,  1823)  to  Vdd  after 
1273  is  driven  high  at  89.S6ns 
...through  fet  at  (313,  1486)  to  Vdd  after 
11735  is  driven  high  at  35.90ns 
...through  fet  at  (303,  1506)  to  Vdd  after 
11765  is  driven  high  at  14.17ns 
...through  fet  at  (287,  1506)  to  Vdd  after 
11745  is  driven  low  at  10.03ns 
...through  fet  at  (285,  1422)  to  GND  after 
11764  is  driven  high  at  5.79ns 
...through  fet  at  (160,  1582)  to  Vdd  after 
12847  is  driven  low  at  0.11ns 
...through  fet  at  (156,  1604)  to  GND  after 
phi2  is  driven  high  at  0.00ns 
|0:00.3u  0:00.1s  1855k] 


**•  FALLTIME  FOR  PH12  IN  NORMAL  OF  ‘ 
:  cl^ar 

;0:00.9u  0:00.3s  1855kj 
:  set  1  op 

Marking  transistor  flow... 

Setting  Vdd  to  1... 

Setting  GND  to  0... 
i0;06.4u  0:00.6s  1855k| 

:  set  0  phil 

|0:00.8u  0:00.1s  1855k] 

:  delay  phi2  •!  0 
(16400  stages  examined.) 

[0:58.8u  0:02.6s  1879k| 

:  critical  Im 

Node  11983  is  driven  low  at  64.28ns 

...through  fet  at  (2836,  1550)  to  GND  after 
12776  is  driven  high  at  64.98ns 
...through  fet  at  (2842,  1602)  to  13219 
...through  fet  at  (2852,  1602)  to  13220 
...through  fet  at  (2863,  1645)  to  \'dd  after 
12892  is  driven  high  at  54.01ns 
...through  fet  at  (2840,  1645)  to  Vdd  after 
13081  is  driven  low  at  53.08n8 
...through  fet  at  (2836,  1656)  to  GND  after 
14010  is  driven  high  at  55.67ns 
...through  fet  at  (2756,  1696)  to  14572 
...through  fet  at  (2772,  1696)  to  I44S7 
...through  fet  at  (2782,  1751)  to  Vdd  after 
14171  is  driven  low  at  35.54ns 
.  through  fet  at  (2767,  1756)  to  GND  after 
16259  is  driven  high  at  33.63ns 

through  fet  at  (2794,  1800)  to  Vdd  after 
16908  is  driven  low  at  22.80ns 
...through  fet  at  (2800,  1819)  to  17829 
...through  fet  at  (2792,  1816)  to  GND  after 
1273  is  driven  high  at  21.54ns 
...through  fet  at  (313,  1486)  to  Vdd  after 
11735  is  driven  hich  at  13.39ns 
...through  ff-t  at  (293,  1506)  to  Vdd  after 
11765  is  driven  low  at  10.69ns 
...through  fet  at  (285,  1483)  to  GND  after 
11745  is  driven  high  at  7.19ns 
...through  fet  at  (287,  1410)  to  Vdd  after 
11764  is  driven  low  at  2.51ns 
...through  fet  at  (156,  1581)  to  GND  after 
12847  is  driven  high  at  0.56ns 
...through  fet  at  (163,  1604)  to  Vdd  after 
phi2  is  driven  low  at  0.00ns 
;0;00.3u  0:00.1s  1879k| 


PHll  RISETIME  IN  NORMAL  OP  *** 

;  clear 

|0:00.9u  OtOO.Ss  1879k| 

;  set  1  op 

Marking  transistor  flow... 

Setting  Vdd  to  1... 

Setting  GND  to  0... 

[0:0e.5u  OrOO.Ss  1879k] 

:  set  0  phi2 

i0;00.2u  0:00.0s  1879k] 

:  delay  phil  0  -1 
(5926  stages  examined.) 

(0:12.1u  0;00.6s  I879kj 
;  critical  Im 

Node  17518  is  driven  high  at  108.62ns 

...through  fet  at  (2256,  1845)  to  18827  after 
normdrout  is  driven  high  at  101.60ns 
...through  fet  at  (4013,  1343)  to  Vdd  after 
10876  is  driven  high  at  48.60ns 
...through  fet  at  (4141,  1351)  to  Vdd  after 
11000  is  driven  high  at  26.81ns 
...through  fet  at  (4163,  1351)  to  Vdd  after 
11302  is  driven  low  at  22.55ns 
...through  fet  at  (4166,  1423)  to  GND  after 

11063  is  driven  high  at  17.47ns 
...through  fet  at  (4408,  1354)  to  Vdd  after 

11064  is  driven  low  at  6.6 1  ns 
...through  fet  at  (4433,  1362)  to  11369 
...through  fet  at  (4433,  1366)  to  GND  after 

10622  is  driven  high  at  5.72ns 
...through  fet  at  (4483,  1305)  to  Vdd  after 
10603  is  driven  low  at  0.11ns 
...through  fet  at  (4498,  1281)  to  GND  after 
phil  is  driven  high  at  0.00ns 
(OrOO.lu  O.OO.ls  I879kl 

***  PHIl  FALLTIME  FOR  NORMAL  OP  •** 

:  clear 

10;00.8u  0:00.3s  1879k) 

:  set  1  op 

Marking  transistor  flow... 

Setting  Vdd  to  1... 

Setting  GND  to  0... 

|0:06.2u  0:00.1s  1879k) 

:  set  0  phi2 

[0:00.2u  0:00.0s  1879k) 

:  delay  phil  -10 
(4092  stages  examined.) 

[0:10.4u  0:00.6s  1896k] 


:  critiral  Im 

Node  4673  is  driven  low  at  89.11ns 

...through  fet  at  (2091,  781)  to  GND  after 
4486  is  driven  high  at  8S.87n8 
...through  fet  at  (27S6,  842)  to  Vdd  after 
normdrout  is  driven  low  at  S9.96ns 
...through  fet  at  (4021,  1S51)  to  GND  after 
11059  is  driven  high  at  32.15na 
...through  fet  at  (4141,  1446)  to  Vdd  after 
11S02  is  driven  high  at  10.54ns 
...through  fet  at  (416S,  1446)  to  Vdd  after 

11063  is  driven  low  at  5.91ns 

...through  fet  at  (4407,  1362)  to  GND  after 

11064  is  driven  high  at  S.13ns 
...through  fet  at  (4434,  1354)  to  Vdd  after 

10622  is  driven  low  at  2.49n8 
...through  fet  at  (4498,  1304)  to  GND  after 
10603  is  driven  high  at  0.56ns 
...through  fet  at  (4489,  1282)  to  Vdd  after 
phil  is  driven  low  at  0.00ns 
[0:00  2u  0:00. Is  1896kj 

PHll  RISETIME  FOR  SHIFT  OP  *** 

;  clear 

(0:00.9u  0:00.3s  1896kl 
:  set  0  op 

Marking  transistor  flow... 

Setting  Vdd  to  1... 

Setting  GND  to  0... 

|0;06.6u  0;00.5s  I896k| 

:  set  0  phi2 

|0.00.2u  0:00.0s  ISOOk) 

:  delay  phil  0  -1 
(11989  stages  examined.) 

10:42.1u  0:01.7s  1918k) 

:  critical  Im 

Node  4354  is  driven  high  at  343.02ns 
...through  fet  at  (2743,  502)  to  3227 
...through  fet  at  (2734,  463)  to  Vdd  after 
shdrout  is  driven  high  at  45.78ns 
...through  fet  at  (4007,  1223)  to  Vdd  after 

10522  is  driven  low  at  29.07ns 

...through  fet  at  (4053,  1228)  to  GND  after 
10336  is  driven  high  at  27.68ns 
...through  fet  at  (4067,  1216)  to  Vdd  after 

10523  is  driven  low  at  24.23ns 

...through  fet  at  (4070,  1266)  to  GND  after 
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10589  is  driven  high  at  19.49ns 
...through  fet  at  (4407,  1334)  to  Vdd  after 
10633  is  driven  low  at  6.61ns 
...through  fet  at  (4433,  1327)  to  10631 
...through  fet  at  (4433,  1324)  to  GND  after 
10622  is  driven  high  at  5.72ns 
...through  fet  at  (4483,  1305)  to  Vdd  after 
10603  is  driven  low  at  O.llns 
...through  fet  at  (4498,  1281)  to  GND  after 
phil  is  driven  high  at  0.00ns 
|0:00.1u  0:00.1s  1918k] 

***  PHIl  F.4LLTIME  FOR  A  SHIFT  OP  ♦** 

:  clear 

[0:00.8u  0:00.3s  1918k| 

:  set  0  op 

Marking  transistor  flow... 

Setting  Vdd  to  1... 

Setting  GND  to  0... 

[0:06.4u  0:00.4s  1918k] 

:  set  0  phi2 

(0:00.2u  0:00.0s  1918k] 

:  delay  phil  -1  0 
(20633  stages  examined.) 

[1:22. 2u  0:08.4s  1961k] 

;  critical  Im 

Node  11983  is  driven  low  at  72.04ns 

...through  fet  at  (2836,  1550)  to  GND  after 
12776  is  driven  high  at  70.74ns 
...through  fet  at  (2842,  1602)  to  13219 
...through  fet  at  (2852,  1602)  to  13220 
...through  fet  at  (2863,  1645)  to  Vdd  after 
12892  is  driven  high  at  61.78ns 
...through  fet  at  (2840,  1645)  to  Vdd  after 
13081  is  driven  low  at  60.84ns 
...through  fet  at  (2836,  1656)  to  GND  after 
14010  is  driven  high  at  63.43ns 
...through  fet  at  (2756,  1696)  to  14572 
...through  fet  at  (2772,  1696)  to  14437 
...through  fet  at  (2782,  1751)  to  Vdd  after 
14171  is  driven  low  at  43.30ns 
...through  fet  at  (2767,  1756)  to  GND  after 
16259  is  driven  high  at  41.39ns 
...through  fet  at  (2794,  1800)  to  Vdd  after 
shdrout  is  driven  low  at  30.70ns 
...through  fet  at  (4032,  1225)  to  GND  after 
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10522  is  driven  high  at  19.22ns 
...through  fet  at  (4045,  1289)  to  Vdd  after 

10523  is  driven  high  at  lO.Olns 
...through  fet  at  (4067,  1289)  to  Vdd  after 

10589  is  driven  low  at  6.29ns 
...through  fet  at  (4406,  1327)  to  GND  after 
10633  is  driven  high  at  3.13ns 
...through  fet  at  (4434,  1334)  to  Vdd  after 
10622  is  driven  low  at  2.49ns 
...through  fet  at  (4498,  1304)  to  GND  after 
10603  is  driven  high  at  0.56ns 
...through  fet  at  (4489,  1282)  to  Vdd  after 
phil  is  driven  low  at  0.00ns 
|0:00.2u  0;00.2s  1961k] 

INPUT  PAD  TO  LATCH  1  DELAY  *** 

:  clear 

|0:00.9u  0:00.7s  1961k| 

:  set  1  op 

Marking  transistor  flow... 

Setting  Vdd  to  1... 

Setting  GND  to  0... 

[0:06.3u  0:00.3s  ]961k| 

:  set  0  phil  phi2 
[0:00.9u  0:00.1s  1961k| 

:  delay  a<15:0>  0  0 
(43921  stages  examined.) 

{2;16.6u  0:09.2s  1961k| 

:  critical  Im 

Node  19554  is  driven  high  at  558.82ns 

...through  fet  at  (1008,  2140)  to  Vdd  after 
19655  is  driven  low  at  554.42n5 
...through  fet  at  (980,  2145)  to  GND  after 
21705  is  driven  high  at  531.79ns 
...through  fet  at  (667,  2760)  to  27839 
...through  fet  at  (677,  2760)  to  27714 
...through  fet  at  (693,  2798)  to  Vdd  after 
27436  is  driven  low  at  485.22ns 
...through  fet  at  (698,  2808)  to  GND  after 
22366  is  driven  high  at  473.40n5 
...through  fet  at  (1823,  3125)  to  Vdd  after 
30352  is  driven  low  at  337.44ns 
...through  fet  at  (183..  3142)  to  GND  after 
30351  is  driven  high  at  332.90ns 
...through  fet  at  (1807,  3257)  to  33567 
...through  fet  at  (1817,  3257)  to  33568 
...through  fet  at  (1840,  3306)  to  Vdd  after 


33186  is  driven  low  at  299  22ns 
...through  fet  at  (1818,  3293)  to  GND  after 
33391  is  driven  high  at  298.15ns 
...through  fet  at  (1822,  3306)  to  Vdd  after 
30591  is  driven  low  at  295.49ns 
...through  fet  at  (1955,  3577)  to  38872 
...through  fet  at.  (1955,  3580)  to  GND  after 
38615  is  driven  high  at  241.93ns 
...through  fet  at  (1997,  3813)  to  Vdd  after 
40527  is  driven  low  at  3. 37ns 
...through  fet  at  (2011,  3839)  to  GND  after 
40457  is  driven  high  at  2.61ns 
...through  fet  at  (2030,  3824)  to  Vdd  after 
40625  is  driven  low  at  0.1  Ins 
...through  fet  at  (2052,  3839)  to  GND  after 
a2  is  driven  high  at  0.00ns 

|0:00.2u  0;00.2s  1961k| 

:  q 

(8:58. 2u  0:49.0s  1961k)  Crystal  done. 


CBXSTAI  results  for  stage  1  for  the  HacPitts  chip 


build  $tag«l.sim 
0  ]2.4u  0;0l.3s  247k] 

:  inputs  in<27:l> 
jOrOO.Ou  0.00.1s  256k| 

:  outp.uts  &<24:1> 

•**  FIRST  STAGE  DELAY  *** 

:  delay  in<27:l>  0  0 
Marking  transistor  flow... 

Setting  Vdd  to  1... 

Setting  GND  to  0... 

(11559  stages  examined.) 

|0;22.7u  0;00.9s  411k] 

;  critical 

Node  2195  is  driven  high  at  4538. SOns 

...through  fet  at  (565,  934)  to  Vdd  after 

2118  is  driven  low  at  4831. 44n8 
...through  fet  at  (506,  926)  to  2127 
...through  fet  at  (506,  921)  to  GND  after 

2095  is  driven  high  at  4825.41ns 
...through  fet  at  (485,  928)  to  Vdd  after 
1867  is  driven  low  at  4813. 82ns 
...through  fet  at  (423,  922)  to  2086 
...through  fet  at  (423,  917)  to  GND  after 
1805  is  driven  high  at  4783.75ns 
...through  fet  at  (669,  910)  to  a2 
...through  fet  at  (683,  910)  to  1944 
...through  fet  at  (620,  934)  to  Vdd  after 

2119  is  driven  low  at  4330.98ns 
...through  fet  at  (585,  924)  to  2103 
...through  fet  at  (585,  919)  to  GND  after 

2048  is  driven  high  at  4326.95ns 
...through  fet  at  (537,  930)  to  Vdd  after 
1933  is  driven  low  at  4314.44ns 
...through  fet  at  (645,  1000)  to  2790 
...through  fet  at  (645,  1005)  to  GND  after 
2730  is  driven  high  at  4306.41ns 
...through  fet  at  (537,  1010)  to  Vdd  after 
2798  is  driven  low  at  4293. 82ns 
...through  fet  at  (506,  1006)  to  2807 
...through  fet  at  (506,  1001)  to  GND  after 
2775  is  driven  high  at  4287.69ns 
...through  fet  at  (485,  1008)  to  Vdd  after 
2551  is  driven  low  at  4275.76ns 
...through  fet  at  (423,  1002)  to  2766 
...through  fet  at  (423,  997)  to  GND  after 


2525  is  driven  high  at  4243. 64ns 
...through  fet  at  (669,  990)  to  aS 
...through  fet  at  (683,  990)  to  2637 
...through  fet  at  (620,  1014)  to  V’dd  after 
2799  is  driven  low  at  3741. 79n8 
...through  fet  at  (585,  1004)  to  2783 
...through  fet  at  (585,  999)  to  GND  after 
2624  is  driven  high  at  S735.64n8 
...through  fet  at  (652,  1074)  to  Vdd  after 
3236  is  driven  low  at  3712.11ns 
...through  fet  at  (423,  1082)  to  3449 
...through  fet  at  (423,  1077)  to  GND  after 
3210  is  driven  high  at  3680.28ns 
...through  fet  at  (669,  1070)  to  a4 
...through  fet  at  (683,  1070)  to  3318 
...through  fet  at  (620,  1094)  to  Vdd  after 
3482  is  driven  low  at  3186.59ns 
...through  fet  at  (585,  1084)  to  3466 
...through  fet  at  (585,  1079)  to  GND  after 
3411  is  driven  high  at  3182.56ns 
...through  fet  at  (537,  1090)  to  Vdd  after 
3307  is  driven  low  at  3170.04ns 
...through  fet  at  (645,  1160)  to  4149 
...through  fet  at  (645,  1165)  to  GND  after 
4087  is  driven  high  at  3162.01ns 
...through  fet  at  (537,  1170)  to  Vdd  after 

4157  is  driven  low  at  3149.43ns 
...through  fet  at  (506,  1166)  to  4166 
...through  fet  at  (506,  1161)  to  GND  after 

4133  is  driven  high  at  3143.25ns 
...through  fet  at  (485,  1168)  to  Vdd  after 
3907  is  driven  low  at  3131.21ns 
...through  fet  at  (423,  1162)  to  4124 
...through  fet  at  (423,  1157)  to  GND  after 
3881  is  driven  high  at  3098.30ns 
...through  fet  at  (669,  1150)  to  a5 
...through  fet  at  (683,  1150)  to  3990 
...through  fet  at  (620,  1174)  to  V'dd  after 

4158  is  driven  low  at  2577.22ns 
...through  fet  at  (585,  1164)  to  4141 
...through  fet  at  (585,  1159)  to  GND  after 

3978  is  driven  high  at  2571.91ns 
...through  fet  at  (652,  1234)  to  Vdd  after 
4770  is  driven  low  at  2555.05ns 
...through  fet  at  (530,  1244)  to  4825 
...through  fet  at  (530,  1239)  to  GND  after 
4841  is  driven  high  at  2547.85ns 
...through  fet  at  (513,  1252)  to  Vdd  after 
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4818  u  driven  low  at  2.'>32.70ns 

...through  fet  at  (478,  1242)  to  4810 
...through  fet  at  (478,  1287)  to  GND  after 
4568  U  driven  high  at  2501. SSns  ' 
...through  fet  at  (669,  1230)  to  a6 
...through  fet  at  (683,  1230)  to  4677 
...through  fet  at  (620,  1254)  to  Vdd  after 
4842  is  driven  low  at  1985.61ns 
...through  fet  at  (585,  1244)  to  4826 
...through  fet  at  (585,  1239)  to  GND  after 
4666  is  driven  high  at  1980.29ns 
...through  fet  at  (652,  1314)  to  Vdd  after 
5456  is  driven  low  at  1963.43ns 
...through  fet  at  (530,  1324)  to  5508 
...through  fet  at  (530,  1819)  to  GND  after 

5526  is  driven  high  at  1956.28ns 
...through  fet  at  (513,  1332)  to  Vdd  after 

5501  is  driven  low  at  1941.04ns 
...through  fet  at  (478,  1322)  to  5493 
...through  fet  at  (478,  1317)  to  GND  after 
5248  is  driven  high  at  1909.46ns 
...through  fet  at  (669,  1310)  to  a7 
...through  fet  at  (683,  1310)  to  5363 
...through  fet  at  (620,  1334)  to  Vdd  after 

5527  is  driven  low  at  1388.69ns 
...through  fet  at  (585,  1324)  to  5509 
...through  fet  at  (585,  1319)  to  GND  after 

5346  is  driven  high  at  1383.38ns 
...through  fet  at  (652,  1394)  to  Vdd  after 
6129  is  driven  low  at  1366.51ns 
...through  fet  at  (530,  1404)  to  6181 
...through  fet  at  (550,  1399)  to  GND  after 

6197  is  driven  high  at  1359.33ns 
...through  fet  at  (513,  1412)  to  Vdd  after 

6174  is  driven  low  at  1344.20ns 
...through  fet  at  (478,  1402)  to  6166 
...through  fet  at  (478,  1397)  to  GND  after 
5928  is  driven  high  at  1312.98ns 
...through  fet  at  (669,  1390)  to  aS 
...through  fet  at  (683,  1390)  to  6056 
...through  fet  at  (620,  1414)  to  Vdd  after 

6198  is  driven  low  at  800.61ns 
...through  fet  at  (585,  1404)  to  6182 

...through  fet  at  (585,  1399)  to  GND  after 
6025  is  driven  high  at  794.45ns 
...through  fet  at  (652,  1474)  to  Vdd  after 
6637  is  driven  low  at  770.92ns 
...through  fet  at  (423,  1482)  to  6842 
...through  fet  at  (423,  1477)  to  GND  after 
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6611  ii>  driven  high  nt  739.09ns 
...through  fet  at  (669,  1470)  to  6644 
...through  fet  at  (683,  1470)  to  6720 
...through  fet  at  (620,  1494)  to  Vdd  after 
755  is  driven  high  at  219.87ns 
...through  fet  at  (634,  410)  to  Vdd  after 
1080  is  driven  low  at  134.69ns 
...through  fet  at  (2443,  2876)  to  GND  after 
7571  is  driven  high  at  10.74ns 
...through  fet  at  (2487,  2858)  to  Vdd  after 
in  16  is  driven  low  at  0.00ns 
[0;00.7u  0:00.4s  411k| 


85 


CBYST&L  results  for  the  clock  inputs  to 
the  registers  of  the  Hacpitts  chip. 

Cr>stal,  V.2 
;  build  timing. sim 
i0;13.gu  0:01.6s  258k] 

;  inputs  phia  phib  phie 
|0:00.0u  0:00.0s  267k) 

PHASE  1  OF  5  **• 

:  set  1  phia  phic 
[0:00. lu  0:00.0s  267k| 

:  delay  phib  0  >1 
(604  stages  examined.) 

|0:00.9u  0:00.1s  271k| 

:  critical 

Node  6392  is  driven  low  at  87.36ns 

...through  fet  at  (2322,  1476)  to  6678 
...through  fet  at  (2314,  1472)  to  GND  after 
6391  is  driven  high  at  81.45ns 
...through  fet  at  (2290,  1485)  to  6679 
...through  fet  at  (2333,  1483)  to  Vdd  after 
588  is  driven  high  at  65.23ns 
...through  fet  at  (2316,  841)  to  Vdd  after 
490  is  driven  low  at  62.98ns 
...through  fet  at  (2314,  834)  to  GND  after 
28  is  driven  high  at  50.57ns 
...through  fet  at  (791,  149)  to  Vdd  after 
21  is  driven  low  at  0.80ns 
...through  fet  at  (817,  134)  to  GND  after 
phib  is  driven  high  at  0.00ns 
[0:00.1u  0:00.1s  271k] 

***  PHASE  2  OF  5  **♦ 

:  clear 

[0:00.1u  0:00.0s  271k] 

:  set  1  phia 

Marking  transistor  flow... 

Setting  Vdd  to  1... 

Setting  GND  to  0... 

|0:00.6u  0;00.0s  271k] 

:  delay  phib  -1  0 
(28  stages  examined.) 
i0:00.1u  O'.OO.Os  271k] 

:  delay  phic  -1  0 
(28  stages  examined.) 
i0:00.1u  0;00.0s  271k] 


rritiral 

Node  590  is  driven  low  at  119.19n$ 

...through  fet  at  (2344,  833)  to  GND  after 
491  is  driven  high  at  113.28ns 
...through  fet  at  (2338,  813)  to  Vdd  after 
25  is  driven  low  at  84.73ns 
...through  fet  at  (651,  134)  to  GND  after 
19  is  driven  high  at  10.74ns 
...through  fet  at  (695,  148)  to  Vdd  after 
phic  is  driven  low  at  0.00ns 
[OrOO.lu  0.00.0s  271k| 


♦•‘PHASE  3  OF  5  ♦•* 

:  clear 

[O-.OO.lu  0:00.0a  271k] 

:  set  0  phib  phic 
[0:00. lu  0:00.0a  271k] 

:  delay  phia  -1  0 
(40  stages  examined.) 

[0:00. lu  0:00.0s  272k| 

:  critical 

Node  574  is  driven  high  at  61.22ns 

...through  fet  at  (2087,  841)  to  Vdd  after 
483  is  driven  low  at  59.11ns 
...through  fet  at  (2085,  834)  to  G.N'D  after 
353  is  driven  high  at  49.97ns 
...through  fet  at  (2088,  802)  to  V'dd  after 
31  is  driven  low  at  30.89ns 
...through  fet  at  (907,  134)  to  GND  after 
23  is  driven  high  at  10.74ns 
...through  fet  at  (951,  148)  to  Vdd  after 
phia  is  driven  low  at  0.00ns 
[0:00. lu  0:00.1s  272k] 


*♦•  PHASE  4  OF  5  ••• 

:  clear 

[0:00. lu  0:00.0s  272k| 

:  set  0  phib  phic 
(0:00. lu  0:00.0s  272k| 

:  delay  phia  0-1 
(40  stages  examined.) 

[0:00. lu  0:00.0s  274kj 
:  critical 

Node  574  is  driven  low  at  54.31ns 

...through  fet  at  (2095,  833)  to  GND  after 


ncn 


483  is  driven  high  at  49.17ns 
...through  fet  at  (2089,  813)  to  Vdd  after 
353  is  driven  low  at  27.72ns 
...through  fet  at  (2082,  792)  to  GND  after 
31  is  driven  high  at  15.16ns 
...through  fet  at  (919,  149)  to  Vdd  after 
23  is  driven  low  at  0.80ns 
...through  fet  at  (945,  134)  to  GND  after 
phia  is  driven  high  at  0.00ns 
[0:00.  lu  0:00.0a  274k| 

PHASE  5  OF  5  *** 

:  clear 

|0:00.1u  0;00.0s  274k] 

:  set  1  phia 

Marking  transistor  flow... 

Setting  Vdd  to  1... 

Setting  GND  to  0... 

|0:00.6u  0:00.1s  274k] 

:  set  0  phib 

[0:00. 1  u  0:00.08  274k] 

:  delay  phic  0  -I 
(412  stages  examined.) 

|0:00.5u  0:00.0s  281k] 

:  critical 

Node  6674  is  driven  low  at  91.61ns 

...through  fet  at  (2136,  1472)  to  GND  after 
6384  is  driven  high  at  85.13ns 
...through  fet  at  (2116,  1476)  to  6673 
...through  fet  at  (2099,  1483)  to  Vdd  after 
578  is  driven  high  at  70.69ns 
...through  fet  at  (2130,  841)  to  Vdd  after 
485  is  driven  low  at  68.51ns 
...through  fet  at  (2128,  834)  to  GND  after 
25  is  driven  high  at  55.79n5 
...through  fet  at  (663,  149)  to  Vdd  after 
19  is  driven  low  at  0.80ns 
...through  fet  at  (689,  134)  to  GND  after 
phic  is  driven  high  at  0.00ns 
|0:00.lu  0:00.0s  28lkj 

:  q 


POWSST  Results  for  tie  16-bit  Hultiplier 

-p  •  mult3'2.sim 

gamma  0  4V**.5.  tox=9<*-08m,  u0=0.08m**2.'V-s 
vdd^SV  vtd=-S.5V.  vif^0.8V,  vsb=2V 

^evs  Pdc  avg  (W)  Pdc_n!ax  (W)  type 


0 

0.000000 

0.000000 

enhancement  pullups 

3720 

1.790881 

2.793533 

depletion  pullups 

194 

0  191948 

0.383896 

special  depletion  pullups 

3914 

1.982829 

3.177428 

TOTAL 

POTIEST  Results  for  the  8-bit  Multiplier. 

%  powest  -p  <  multip8c4.sim 

gamma=0.4V**.5,  tox=9e-08m,  u0=0.08m**2/V-8 
vdd=5V,  vtd=.S.5V,  vte=0.8V,  vsb=2V 

^Udevs  Pdc _avg  (W)  Pdc  max  (W)  type 


0 

0.000000 

0.000000 

enhancement  pullup 

690 

0.140672 

0.244640 

depletion  pullups 

111 

0.211404 

0.422809 

special  depletion  pullups 

801 

0.352076 

0.667449 

TOTAL 

i£££ISI2  c 

TEST  TECIOfiS 

This  appendix  contains  the  inputs,  intermediate  latch 
▼alues,  and  the  final  product  output  for  each  of  the  test 
vector  pairs  described  in  Chapter  3.  Each  binary  value  is 
represented  as  its  hexadecimal  e<:jai valent.  The  inputs  and 
outputs  are  represented  with  their  most  significant  hexa¬ 
decimal  digit  in  the  leftmost  position.  The  intermediate 
latch  contents  are  represented  in  hexadecimal  with  the  Nth 
bit  shifted  out  of  the  latch  and  placed  to  the  left  of  the 
previous  bit  serially  shifted  out.  The  latch  at  the  end  of 
stage  X  is  identified  as  latchx  vhere  X  goes  from  1  to  4. 


TEST  VECTOB  1 


INPUTS: 


001B 


OOBF 


OUTPUT:  00000F15 


LATCB1: 

LAICH2: 

LATCB3: 


0000000000000000000000000000011072E7 

OOOOOOOOOQ2A6E7 

000000000000153DC7 


LATCB4:  OOOOOOOOOOCA9DC7 


TEST  VECTCE  2 


INPUTS: 

OUTPCT: 


LATCB4: 


FF71 

FFFFFOEB 


0C1B 


LATCai :  659659659659659659659659768C0D5A7295 

LAICB2:  155555554AIE695 

LATCB3:  2A9AA6A9AA15070015 


AA552595457B8D15 


INFUIS: 

OUTPUT: 


008F 

FFFFFOEB 


IIE5 


LATCB1:  4104104104104010406016C90B6062250A53 

LATCB2:  155555555530653 

IATCB3:  2A9AA6A9AA2A580C93 


LATCB4: 


AA552A954ASC0C93 


lEST  7ECI0R  4 


IHPUTS:  FFE5 

OUTPUT:  00000F15 


FF71 


LATCB1:  4F3CF3CF3CI7CA38E76aB6C85B49E01EB429 

1ATCH2:  0AAB34D5562A829 

LATCB3:  15559699AAAC354049 

LAICB4:  55ACE9B55E0AA049 


TEST  VECTOS  5 


INPDIS: 

OOTEOT: 


0463 

000F4491 


037  B 


LATCB1 :  00000000000000000020800884A01F32641F 


1ATCB2; 


00000014A1C641F 


LATCB3:  00000001A50383083F 

LATCB4;  000000 1 4A OE 1833F 


TEST  7EC10B  6 


IHPDIS: 


0373 


TE9D 


OOTPIJI:  FFF0BB6F 


LATCE1:  41041 04 104C04506 532 4551459A2330F169B 

LATCB2;  15552C756 994A9B 

LATCB3:  2A9A918FAB130A4513 

11TCH4:  AA5491F564C5251B 
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TEST  ¥EC1CB  7 

INPQIS:  8000  8000 

OUTPOi;:  40000000 

LATCB1:  000410410410410410410410412000000000 

LATCE2:  0155555558COOOO 

LATCB3:  029AA6A9AAEOOOOOOO 

LATCB4:  OAD56AB55CCOOOOO 


_  _  oi  Diaitaj.  Pipe 

Tliesis,  Naval  PosTgraSaa 
California,  June  1984. 
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